idnits 2.17.1 draft-dwmtwc-dnsop-caching-resolution-failures-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 2 instances of lines with non-ascii characters in the document. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. -- The abstract seems to indicate that this document updates RFC2308, but the header doesn't have an 'Updates:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (13 January 2022) is 833 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'CITATION NEEDED' is mentioned on line 189, but not defined -- Obsolete informational reference (is this intentional?): RFC 882 (Obsoleted by RFC 1034, RFC 1035) -- Obsolete informational reference (is this intentional?): RFC 883 (Obsoleted by RFC 1034, RFC 1035) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force D. Wessels 3 Internet-Draft W. Carroll 4 Intended status: Standards Track M. Thomas 5 Expires: 17 July 2022 Verisign 6 13 January 2022 8 Negative Caching of DNS Resolution Failures 9 draft-dwmtwc-dnsop-caching-resolution-failures-00 11 Abstract 13 In the DNS, resolvers employ caching to reduce both latency for end 14 users and load on authoritative name servers. The process of 15 resolution may result in one of three types of responses: (1) a 16 response containing the requested data; (2) a response indicating the 17 requested data does not exist; or (3) a non-response due to a 18 resolution failure in which the resolver does not receive any useful 19 information regarding the data's existence. This document concerns 20 itself only with the third type. 22 RFC 2308 specifies requirements for DNS negative caching. There, 23 caching of type (1) and (2) responses is mandatory and caching of 24 type (3) responses is optional. This document updates RFC 2308 to 25 require negative caching for DNS resolution failures. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on 17 July 2022. 44 Copyright Notice 46 Copyright (c) 2022 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 51 license-info) in effect on the date of publication of this document. 52 Please review these documents carefully, as they describe your rights 53 and restrictions with respect to this document. Code Components 54 extracted from this document must include Revised BSD License text as 55 described in Section 4.e of the Trust Legal Provisions and are 56 provided without warranty as described in the Revised BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 61 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.2. Related Work . . . . . . . . . . . . . . . . . . . . . . 5 63 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 64 2. Types of DNS Resolution Failures . . . . . . . . . . . . . . 6 65 2.1. Server Failure . . . . . . . . . . . . . . . . . . . . . 6 66 2.2. Refused Response Code . . . . . . . . . . . . . . . . . . 6 67 2.3. Timeouts . . . . . . . . . . . . . . . . . . . . . . . . 7 68 2.4. Delegation Loops . . . . . . . . . . . . . . . . . . . . 7 69 2.5. Alias Loops . . . . . . . . . . . . . . . . . . . . . . . 8 70 2.6. DNSSEC Validation Failures . . . . . . . . . . . . . . . 8 71 3. DNS Negative Caching Requirements . . . . . . . . . . . . . . 8 72 3.1. Retries and Timeouts . . . . . . . . . . . . . . . . . . 8 73 3.2. TTLs . . . . . . . . . . . . . . . . . . . . . . . . . . 8 74 3.3. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 9 75 3.4. Requerying Delegation Information . . . . . . . . . . . . 9 76 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 77 5. Security Considerations . . . . . . . . . . . . . . . . . . . 10 78 6. Privacy Considerations . . . . . . . . . . . . . . . . . . . 10 79 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10 80 8. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . 10 81 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 82 9.1. Normative References . . . . . . . . . . . . . . . . . . 10 83 9.2. Informative References . . . . . . . . . . . . . . . . . 11 84 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 86 1. Introduction 88 Caching has always been a fundamental component of DNS resolution on 89 the Internet. For example [RFC0882] states: 91 "The sheer size of the database and frequency of updates suggest that 92 it must be maintained in a distributed manner, with local caching to 93 improve performance." 94 The early DNS RFCs ([RFC0882], [RFC0883], [RFC1034], and [RFC1035]) 95 primarily discuss caching in the context of what [RFC2308] calls 96 "positive" responses, that is, when the response includes the 97 requested data. In this case, a TTL is associated with each resource 98 record in the response. Resolvers can cache and reuse the data until 99 the TTL expires. 101 Section 4.3.4 of [RFC1034] describes negative response caching, but 102 notes it is optional and only talks about name errors (NXDOMAIN). 103 This is the origin of using the SOA MINIMUM field as a negative 104 caching TTL. 106 [RFC2308] updated [RFC1034] to specify new requirements for DNS 107 negative caching, including making it mandatory for name error 108 (NXDOMAIN) and no data responses. It further specified optional 109 negative caching for two DNS resolution failure cases: server failure 110 and dead / unreachable servers. 112 FOR DISCUSSION: RFC 2308 seems to use RFC 2119 keywords somewhat 113 inconsistently when in comes to requirements for negative caching of 114 type (1) and (2) responses. For example: 116 * Abstract: "negative caching should no longer be seen as an 117 optional part of..." 119 * Section 5: "A negative answer that resulted from a name error 120 (NXDOMAIN) should be cached..." 122 * Section 5: "A negative answer that resulted from a no data error 123 (NODATA) should be cached..." 125 * Section 8: "Negative caching in resolvers is no-longer optional, 126 if a resolver caches anything it must also cache negative 127 answers." 129 This document updates [RFC2308] to require negative caching of DNS 130 resolution failures, and provides additional examples of resolution 131 failures. 133 1.1. Motivation 135 Operators of DNS services have known for some time that recursive 136 resolvers become more aggressive when they experience resolution 137 failures. A number of different anecdotes, experiments, and 138 incidents support this claim. 140 [The authors vaguely recall stories of a moderately popular DNSBL 141 that wanted to shut down, but found that not responding or REFUSED 142 caused an overwhelming amount of traffic. Are there any citable 143 references to this happening?] 145 In December 2009, a secondary server for a number of in-addr.arpa 146 subdomains saw its traffic suddenly double, and queries of type 147 DNSKEY in particular increase by approximately two orders of 148 magnitude, coinciding with a DNSSEC key rollover by the zone operator 149 [roll-over-and-die]. This predated a signed root zone and an 150 operating system vendor was providing non-root trust anchors to the 151 recursive resolver, which became out-of-date following the rollover. 152 Unable to validate responses for the affected in-addr.arpa zones, 153 recursive resolvers aggressively retried their queries. 155 In 2016, the internet infrastructure company Dyn experienced a large 156 attack that impacted many high-profile customers. As documented in a 157 technical presentation detailing the attack [dyn-attack], Dyn staff 158 wrote: "At this point we are now experiencing botnet attack traffic 159 and what is best classified as a 'retry storm'. Looking at certain 160 large recursive platforms > 10x normal volume." 162 In 2018 the root zone key signing key (KSK) was rolled over 163 [root-ksk-roll]. Throughout the rollover period, the root servers 164 experienced a significant increase in DNSKEY queries. Before the 165 rollover, a.root-servers.net and j.root-servers.net together received 166 about 15 million DNSKEY queries per day. At the end of the 167 revocation period, they received 1.2 billion per day -- an 80x 168 increase. Removal of the revoked key from the zone caused DNSKEY 169 queries to drop to post-rollover but pre-revoke levels, indicating 170 there is still a population of recursive resolvers using the previous 171 root trust anchor and aggressively retrying DNSKEY queries. 173 In 2021, Verisign researchers used botnet query traffic to 174 demonstrate that certain large, public recursive DNS services exhibit 175 very high query rates when all authoritative name servers for a zone 176 return REFUSED or SERVFAIL [botnet]. When configured normally, query 177 rates for a single botnet domain averaged approximately 50 queries 178 per second. However, when configured to return SERVFAIL, the query 179 rate increased to 60,000 per second. Furthermore, increases were 180 also observed at the Root and TLD levels, even though delegations at 181 those levels were unchanged and continued operating normally. 183 Later that same year, on October 4, Facebook experienced a widespread 184 and well-publicized outage [fb-outage]. During the 6-hour outage, 185 none of Facebook's authoritative name servers were reachable and did 186 not respond to queries. Recursive name servers attempting to resolve 187 Facebook domains experienced timeouts. During this time query 188 traffic on the .COM/.NET infrastructure increased from 7,000 to 189 900,000 queries per second [CITATION NEEDED]. 191 1.2. Related Work 193 [RFC2308] describes negative caching for four types of DNS queries 194 and responses: Name errors, no data, server failures, and dead / 195 unreachable servers. It places the strongest requirements on 196 negative caching for name errors and no data responses, while server 197 failures and dead servers are left as optional. 199 [RFC4697] is a Best Current Practice that documents observed 200 resolution misbehaviors. It describes a number of situations that 201 can lead to excessive queries from recusrive resolvers. including: 202 requerying for delegation data, lame servers, responses blocked by 203 firewalls, and records with zero TTL. [RFC4697] makes a number of 204 recommendations, varying from "SHOULD" to "MUST." 206 An expired Internet Draft describes "The DNS thundering herd problem" 207 [thundering-herd] as a situation arising when cached data expires at 208 the same time for a large number of users. Although that document is 209 not focused on negative caching, it does describe the benefits of 210 combining multiple, identical queries to upstream name servers. That 211 is, when a recursive resolver receives multiple queries for the same 212 name, class, and type that cannot be answered from cached data, it 213 should combine or join them into a single upstream query, rather than 214 emit repeated, identical upstream queries. 216 [RFC5452], "Measures for Making DNS More Resilient against Forged 217 Answers," includes a section that describes the phenomenon known as 218 birthday attacks. Here, again, the problem arises when a recursive 219 resolver emits multiple, identical upstream queries. Multiple 220 outstanding queries makes it easier for an attacker to guess and 221 correctly match some of the DNS message parameters, such as the port 222 number and ID field. This situation is only exacerbated in the case 223 of timeout-based resolution failures. DNSSEC, of course, is a 224 suitable defense to spoofing attacks. 226 [RFC8767] describes "Serving Stale Data to Improve DNS Resiliency." 227 This permits a recursive resolver to return possibly stale data when 228 it is unable to refresh cached, expired data. It introduces the idea 229 of a failure recheck timer and says: "Attempts to refresh from non- 230 responsive or otherwise failing authoritative nameservers are 231 recommended to be done no more frequently than every 30 seconds." 233 1.3. Terminology 235 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 236 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 237 "OPTIONAL" in this document are to be interpreted as described in BCP 238 14 [RFC2119] [RFC8174] when, and only when, they appear in all 239 capitals, as shown here. 241 The terms Private Use, Reserved, Unassigned, and Specification 242 Required are to be interpreted as defined in [RFC8126]. 244 2. Types of DNS Resolution Failures 246 This section describes a number of different resolution failure 247 conditions. 249 2.1. Server Failure 251 Server failure is defined in [RFC1035] as: "The name server was 252 unable to process this query due to a problem with the name server." 253 A server failure is signaled by setting the RCODE field to SERVFAIL. 255 Authoritative servers, and more specifically secondary servers, 256 return server failure responses when they don't have any valid data 257 for a zone. That is, a secondary server has been configured to serve 258 a particular zone, but is unable to retrieve or refresh the zone data 259 from the primary server. 261 Recursive servers return server failure in response to a number of 262 different conditions, including many described below. 264 2.2. Refused Response Code 266 A name server returns a message with the RCODE field set to REFUSED 267 when it refuses to process the query for policy reasons. 269 Authoritative servers generally return REFUSED when processing a 270 query for which they are not authoritative. For example, a server 271 that is configured to be authoritative for only the EXAMPLE.NET zone, 272 may return REFUSED in response to a query for EXAMPLE.COM. 274 Recursive servers generally return REFUSED for query sources that do 275 not match configured access control lists. For example, a server 276 that is configured to allow queries from only 2001:DB8:1::/48 may 277 return REFUSED in response to a query from 2001:DB8:5::1. 279 2.3. Timeouts 281 A timeout occurs when a resolver fails to receive any response from a 282 server within a reasonable amount of time. [RFC2308] refers to this 283 as a "dead / unreachable server." 285 Note that resolver implementations may have two types of timeouts: a 286 smaller timeout which might trigger a query retry and a larger 287 timeout after which the server is considered unresponsive. 289 Timeouts can present a particular problem for negative caching, 290 depending on how the resolver handles multiple, outstanding queries 291 for the same tuple. For example, consider 292 a very popular website in a zone whose name servers are all 293 unresponsive. A recursive resolver might receive tens or hundreds of 294 queries per second for the popular website. If the recursive server 295 implementation "joins" these outstanding queries together, then it 296 only sends one recursive-to-authoritative query for the numerous 297 pending stub-to-recursive queries. If, however, the implementation 298 does not join outstanding queries together, then it send one 299 recursive-to-authoritative query for each stub-to-recursive query. 300 If the incoming query rate is high and the timeout is large, this 301 might result in hundreds or thousands of recursive-to-authoritative 302 queries while waiting for an authoritative server to time out. 304 2.4. Delegation Loops 306 A delegation loop, or cycle, can occur when one domain utilizes name 307 servers in a second domain, and the second domain uses name servers 308 in the first. For example: 310 FOO.EXAMPLE. NS NS1.EXAMPLE.COM. 311 FOO.EXAMPLE. NS NS2.EXAMPLE.COM. 313 EXAMPLE.COM. NS NS1.FOO.EXAMPLE. 314 EXAMPLE.COM. NS NS2.FOO.EXAMPLE. 316 In this example, no names under FOO.EXAMPLE or EXAMPLE.COM can be 317 resolved because of the delegation loop. Note that delegation loop 318 may involve more than two domains. A resolver that does not detect 319 delegation loops may generate DDoS-levels of attack traffic to 320 authoritative name servers, as documented in the TsuNAME 321 vulnerability [TsuNAME]. 323 2.5. Alias Loops 325 An alias loop, or cycle, can occur when one CNAME or DNAME RR refers 326 to a second name, which in turn is specified as an alias for the 327 first. For example: 329 APP.FOO.EXAMPLE. CNAME APP.EXAMPLE.NET. 330 APP.EXAMPLE.NET. CNAME APP.FOO.EXAMPLE. 332 The need to detect CNAME loops has been known since at least 333 [RFC1034] which states in Section 3.6.2: 335 "Of course, by the robustness principle, domain software should not 336 fail when presented with CNAME chains or loops; CNAME chains should 337 be followed and CNAME loops signaled as an error." 339 2.6. DNSSEC Validation Failures 341 Negative caching of DNSSEC validation errors is described in section 342 4.7 of [RFC4035]. 344 FOR DISCUSSION: RFC4035 says "resolvers MAY cache data with invalid 345 signatures" while in this document all resolution failures MUST be 346 negatively cached. The focus of 4035 seems to be on caching bad 347 *data* rather than caching a more general resolution failure (e.g. 348 inability to retrieve keys). 350 3. DNS Negative Caching Requirements 352 3.1. Retries and Timeouts 354 A resolver MUST NOT retry more than twice (i.e., three queries in 355 total) before considering a server unresponsive. 357 This document does not place any requirements on timeout values, 358 which may be implementation- or configuration-dependent. It is 359 generally expected that typical timeout values range from 3 to 30 360 seconds. 362 3.2. TTLs 364 Resolvers MUST cache resolution failures for at least 5 seconds. 365 Resolvers SHOULD employ an exponential backoff algorithm to increase 366 the amount of time for subsequent resolution failures. For example, 367 the initial negative cache TTL is set to 5 seconds. The TTL is 368 doubled after each retry that results in another resolution failure. 369 Consistent with [RFC2308], resolution failures MUST NOT be cached for 370 longer than 5 minutes. 372 3.3. Scope 374 Resolution failures MUST be cached against the specific query tuple 375 . 377 It is common for resolvers to have multiple servers from which to 378 choose for a particular query. For example, in the case of stub-to- 379 recursive, the stub resolver may be configured with multiple resolver 380 addresses. In the case of recursive-to-authoritative, a given zone 381 usually has more than one name server (NS record), each of which can 382 have multiple IP addresses. 384 Nothing in this document prevents a resolver from retrying a query at 385 a different server. However, if all known servers for a query tuple 386 return server failures, the resolver MUST 387 NOT send further queries for the tuple until the corresponding 388 negative cache entries expire. 390 3.4. Requerying Delegation Information 392 Quoting from [RFC4697]: 394 There can be times when every name server in a zone's NS RRSet is 395 unreachable (e.g., during a network outage), unavailable (e.g., the 396 name server process is not running on the server host), or 397 misconfigured (e.g., the name server is not authoritative for the 398 given zone, also known as "lame"). 400 This document reiterates the requirement from Section 2.1.1 of 401 [RFC4697]: 403 An iterative resolver MUST NOT send a query for the NS RRSet of a 404 non-responsive zone to any of the name servers for that zone's parent 405 zone. For the purposes of this injunction, a non-responsive zone is 406 defined as a zone for which every name server listed in the zone's NS 407 RRSet: 409 1. is not authoritative for the zone (i.e., lame), or 411 2. returns a server failure response (RCODE=2), or 413 3. is dead or unreachable according to Section 7.2 of [RFC2308]. 415 FOR DISCUSSION: the requirement quoted above may be problematic 416 today. e.g., focusing on NS as the query type (a) probably goes 417 against qname miniimzation, and (b) is not the real problem. Also 418 RFC 4697 doesn't place any time restriction (TTL) on this. 420 4. IANA Considerations 422 None 424 5. Security Considerations 426 This is intended to improve security. 428 Future work: Think about if/how new requirements could be abused, 429 used for DoS. 431 6. Privacy Considerations 433 This specification has no impact on user privacy. 435 7. Acknowledgments 437 The authors wish to thank ... 439 8. Change Log 441 RFC Editor: Please remove this section before publication. 443 This section lists substantial changes to the document as it is being 444 worked on. 446 9. References 448 9.1. Normative References 450 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 451 STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987, 452 . 454 [RFC1035] Mockapetris, P., "Domain names - implementation and 455 specification", STD 13, RFC 1035, DOI 10.17487/RFC1035, 456 November 1987, . 458 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 459 Requirement Levels", BCP 14, RFC 2119, 460 DOI 10.17487/RFC2119, March 1997, 461 . 463 [RFC2308] Andrews, M., "Negative Caching of DNS Queries (DNS 464 NCACHE)", RFC 2308, DOI 10.17487/RFC2308, March 1998, 465 . 467 [RFC4697] Larson, M. and P. Barber, "Observed DNS Resolution 468 Misbehavior", BCP 123, RFC 4697, DOI 10.17487/RFC4697, 469 October 2006, . 471 [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for 472 Writing an IANA Considerations Section in RFCs", BCP 26, 473 RFC 8126, DOI 10.17487/RFC8126, June 2017, 474 . 476 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 477 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 478 May 2017, . 480 9.2. Informative References 482 [botnet] Wessels, D. and M. Thomas, "Botnet Traffic Observed at 483 Various Levels of the DNS Hierarchy", May 2021, 484 . 486 [dyn-attack] 487 Sullivan, A., "Dyn, DDoS, and DNS", March 2017, 488 . 492 [fb-outage] 493 Janardhan, S., "More details about the October 4 outage", 494 October 2021, . 497 [RFC0882] Mockapetris, P., "Domain names: Concepts and facilities", 498 RFC 882, DOI 10.17487/RFC0882, November 1983, 499 . 501 [RFC0883] Mockapetris, P., "Domain names: Implementation 502 specification", RFC 883, DOI 10.17487/RFC0883, November 503 1983, . 505 [RFC4035] Arends, R., Austein, R., Larson, M., Massey, D., and S. 506 Rose, "Protocol Modifications for the DNS Security 507 Extensions", RFC 4035, DOI 10.17487/RFC4035, March 2005, 508 . 510 [RFC5452] Hubert, A. and R. van Mook, "Measures for Making DNS More 511 Resilient against Forged Answers", RFC 5452, 512 DOI 10.17487/RFC5452, January 2009, 513 . 515 [RFC8767] Lawrence, D., Kumari, W., and P. Sood, "Serving Stale Data 516 to Improve DNS Resiliency", RFC 8767, 517 DOI 10.17487/RFC8767, March 2020, 518 . 520 [roll-over-and-die] 521 Michaleson, G., Wallström, P., Arends, R., and G. Huston, 522 "Roll Over and Die?", February 2010, 523 . 525 [root-ksk-roll] 526 Müller, M., Thomas, M., Wessels, D., Hardaker, W., Chung, 527 T., Toorop, W., and R.v. Rijswijk-Deij, "Roll, Roll, Roll 528 Your Root: A Comprehensive Analysis of the First Ever 529 DNSSEC Root KSK Rollover", October 2019, 530 . 532 [thundering-herd] 533 Sivaraman, M. and C. Liu, "The DNS thundering herd problem 534 (expired Internet Draft)", June 2020, 535 . 538 [TsuNAME] Moura, G. C. M., Castro, S., Heidemann, J., and W. 539 Hardaker, "TsuNAME: exploiting misconfiguration and 540 vulnerability to DDoS DNS", November 2021, 541 . 543 Authors' Addresses 545 Duane Wessels 546 Verisign 547 12061 Bluemont Way 548 Reston 550 Phone: +1 703 948-3200 551 Email: dwessels@verisign.com 552 URI: https://verisign.com 554 William Carroll 555 Verisign 556 12061 Bluemont Way 557 Reston 559 Phone: +1 703 948-3200 560 Email: wicarroll@verisign.com 561 URI: https://verisign.com 562 Matthew Thomas 563 Verisign 564 12061 Bluemont Way 565 Reston 567 Phone: +1 703 948-3200 568 Email: mthomas@verisign.com 569 URI: https://verisign.com