idnits 2.17.1 draft-tale-dnsop-serve-stale-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document updates RFC1034, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document updates RFC1035, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC1034, updated by this document, for RFC5378 checks: 1987-11-01) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 27, 2017) is 2487 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Obsolete informational reference (is this intentional?): RFC 7719 (Obsoleted by RFC 8499) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 DNSOP Working Group D. Lawrence 3 Internet-Draft Akamai Technologies 4 Updates: 1034, 1035 (if approved) W. Kumari 5 Intended status: Standards Track Google 6 Expires: December 29, 2017 June 27, 2017 8 Serving Stale Data to Improve DNS Resiliency 9 draft-tale-dnsop-serve-stale-01 11 Abstract 13 This draft defines a method for recursive resolvers to use stale DNS 14 data to avoid outages when authoritative nameservers cannot be 15 reached to refresh expired data. 17 Ed note 19 Text inside square brackets ([]) is additional background 20 information, answers to frequently asked questions, general musings, 21 etc. They will be removed before publication. This document is 22 being collaborated on in GitHub at . The most recent version of the document, open issues, etc 24 should all be available here. The authors gratefully accept pull 25 requests. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on December 29, 2017. 44 Copyright Notice 46 Copyright (c) 2017 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 62 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 3. Description . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 4. Implementation Caveats . . . . . . . . . . . . . . . . . . . 4 65 5. Implementation Status . . . . . . . . . . . . . . . . . . . . 5 66 6. Security Considerations . . . . . . . . . . . . . . . . . . . 6 67 7. Privacy Considerations . . . . . . . . . . . . . . . . . . . 6 68 8. NAT Considerations . . . . . . . . . . . . . . . . . . . . . 6 69 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6 70 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 6 71 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 6 72 11.1. Normative References . . . . . . . . . . . . . . . . . . 6 73 11.2. Informative References . . . . . . . . . . . . . . . . . 7 74 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 7 76 1. Introduction 78 Traditionally the Time To Live (TTL) of a DNS resource record has 79 been understood to represent the maximum number of seconds that a 80 record can be used before it must be discarded, based on its 81 description and usage in [RFC1035] and clarifications in [RFC2181]. 82 Specifically, [RFC1035] Section 3.2.1 says that it "specifies the 83 time interval that the resource record may be cached before the 84 source of the information should again be consulted". 86 Notably, the original DNS specification does not say that data past 87 its expiration cannot be used. This document proposes a method for 88 how recursive resolvers should handle stale DNS data to balance the 89 competing needs of resiliency and freshness. It is predicated on the 90 observation that authoritative server unavailability can cause 91 outages even when the underlying data those servers would return is 92 typically unchanged. 94 There are a number of reasons why an authoritative server may become 95 unreachable, including Denial of Service (DoS) attacks, network 96 issues, and so on. This document suggests that, if the recursive 97 server is unable to contact the authoritative server but still has 98 data for the query name, it essentially extends the TTL of the 99 existing data on the assumption that "stale bread is better than no 100 bread". 102 Several major recursive resolver operations currently use stale data 103 for answers in some way, including Akamai, OpenDNS, and Xerocole. 105 2. Terminology 107 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 108 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 109 document are to be interpreted as described in [RFC2119]. 111 For a comprehensive treatment of DNS terms, please see [RFC7719]. 113 3. Description 115 Three notable timers drive considerations for the use of stale data, 116 as follows: 118 o A client response timer, which is the maximum amount of time a 119 recursive resolver should allow between the receipt of a 120 resolution request and sending its response. 122 o A query resolution timer, which caps the total amount of time a 123 recursive resolver spends processing the query. 125 o A maximum stale timer, which caps the amount of time that records 126 will be kept past their expiration. 128 Recursive resolvers already have the second timer; the first and 129 third timers are new concepts for this mechanism. 131 When a request is received by the recursive resolver, it SHOULD start 132 the client response timer. This timer is used to avoid client 133 timeouts. It SHOULD be configurable, with a recommended value of 1.8 134 seconds. 136 The resolver then checks its cache for an unexpired answer. If it 137 finds none and the Recursion Desired flag is not set in the request, 138 it SHOULD immediately return the response without consulting the 139 cache for expired records. 141 If iterative lookups will be done, it SHOULD start the query 142 resolution timer. This timer bounds the work done by the resolver, 143 and is commonly around 10 to 30 seconds. [ BIND 9 used to use a hard- 144 coded constant of 30 seconds and has more recently added a 145 configuration parameter that defaults to 10 seconds and is capped at 146 30. A rigorous exploration of other implementations has not yet been 147 done. ] 149 If the answer has not been completely determined by the time the 150 client response timer has elapsed, the resolver SHOULD then check its 151 cache to see whether there is expired data that would satisfy the 152 request. If so, it adds that data to the response message and SHOULD 153 set the TTL of each expired record in the message to 1 second. The 154 response is then sent to the client while the resolver continues its 155 attempt to refresh the data. 1 second was chosen because 156 historically 0 second TTLs have been problematic for some 157 implementations. It not only sidesteps those potential problems with 158 no practical negative consequence, it would also rate limit further 159 queries from any client that is honoring the TTL, such as a 160 forwarding resolver. 162 The maximum stale timer is used for cache management and is 163 independent of the query resolution process. This timer is 164 conceptually different from the maximum cache TTL that exists in many 165 resolvers, the latter being a clamp on the value of TTLs as received 166 from authoritative servers. The maximum stale timer SHOULD be 167 configurable, and defines the length of time after a record expires 168 that it SHOULD be retained in the cache. The suggested value is 7 169 days, which gives time to notice the resolution problem and for human 170 intervention for fixing it. 172 This same basic technique MAY be used to handle stale data associated 173 with delegations. If authoritative server addresses are not able to 174 be refreshed, resolution can possibly still be successful if the 175 authoritative servers themselves are still up. 177 4. Implementation Caveats 179 Answers from authoritative servers that have a DNS Response Code of 180 either 0 (NOERROR) or 3 (NXDOMAIN) MUST be considered to have 181 refreshed the data at the resolver. In particular, this means that 182 this method is not meant to protect against operator error at the 183 authoritative server that turns a name that is intended to be valid 184 into one that is non-existent, because there is no way for a resolver 185 to know intent. 187 [ Paul Vixie has suggested that it be made explicit that an auth 188 NXDOMAIN cause all data, even stale data, below the NXDOMAIN to also 189 be removed, a la https://datatracker.ietf.org/doc/draft-vixie-dnsext- 190 resimprove/. Conceptually this certainly has its appeal but 191 addressing it in this document when resimprove has not progressed has 192 procedural problems. This paragraph will be removed in the next 193 draft, either dropping the idea here completely or blessing it based 194 on positive feedback to do so. ] 196 Resolution is given a chance to succeed before stale data is used to 197 adhere to the original intent of the design of the DNS. This 198 mechanism is only intended to add robustness to failures, and to be 199 enabled all the time. If stale data were used immediately and then a 200 cache refresh attempted after the client response has been sent, the 201 resolver would frequently be sending data that it would have had no 202 trouble refreshing. 204 It is important to continue the resolution attempt after the stale 205 response has been sent, until the query resolution timeout, because 206 some pathological resolutions can take many seconds to succeed as 207 they cope with unavailable servers, bad networks, and other problems. 208 Stopping the resolution attempt when the response with expired data 209 has been sent would mean that answers in these pathological cases 210 would never be refreshed. 212 Canonical Name (CNAME) records mingled in the expired cache with 213 other records at the same owner name can cause surprising results. 214 This was observed with an initial implementation in BIND, where a 215 hostname changed from having an IPv4 Address (A) record.to a CNAME. 216 The version of BIND being used did not evict other types in the cache 217 when a CNAME was received, which in normal operations is not a 218 significant issue. However, after both records expired and the 219 authorities became unavailable, the fallback to stale answers 220 returned the older A instead of the newer CNAME. 222 [ This probably applies to other occluding types, so more thought 223 should be given to the overall issue. ] 225 Keeping records around after their normal expiration will of course 226 cause caches to grow larger than if records were removed at their 227 TTL. Specific guidance on managing cache sizes is outside the scope 228 of this document. Some areas for consideration include whether to 229 track the popularity of names in client requests versus evicting by 230 maximum age, and whether to provide a feature for manually flushing 231 only stale records. 233 5. Implementation Status 235 [RFC Editor: per RFC 6982 this section should be removed prior to 236 publication.] 238 The algorithm described in this draft was originally implemented as a 239 patch to BIND 9.7.0. It has been in production on Akamai's 240 production network since 2011, and effectively smoothed over 241 transient failures and longer outages that would have resulted in 242 major incidents. The patch has been contributed to the Internet 243 Systems Consortium in anticipation that it will be incorporated to 244 their main BIND distribution. 246 Unbound has a similar feature for serving stale answers, but it works 247 in a very different way by returning whatever cached answer it has 248 before trying to refresh expired records. 250 6. Security Considerations 252 The most obvious security issue is the increased likelihood of DNSSEC 253 validation failures when using stale data because signatures could be 254 returned outside their validity period. This would only be an issue 255 if the authoritative servers are unreachable, the only time the 256 techniques in this document are used, and thus does not introduce a 257 new failure in place of what would have otherwise been success. 259 Additionally, bad actors have been known to use DNS caches to keep 260 records alive even after their authorities have gone away. This 261 makes that easier. 263 7. Privacy Considerations 265 This document does not add any practical new privacy issues. 267 8. NAT Considerations 269 The method described here is not affected by the use of NAT devices. 271 9. IANA Considerations 273 This document contains no actions for IANA. This section will be 274 removed during conversion into an RFC by the RFC editor. 276 10. Acknowledgements 278 The authors wish to thank Matti Klock, Mukund Sivaraman, Jean Roy, 279 and Jason Moreau for initial review. 281 11. References 283 11.1. Normative References 285 [RFC1035] Mockapetris, P., "Domain names - implementation and 286 specification", STD 13, RFC 1035, DOI 10.17487/RFC1035, 287 November 1987, . 289 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 290 Requirement Levels", BCP 14, RFC 2119, 291 DOI 10.17487/RFC2119, March 1997, 292 . 294 [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS 295 Specification", RFC 2181, DOI 10.17487/RFC2181, July 1997, 296 . 298 11.2. Informative References 300 [RFC7719] Hoffman, P., Sullivan, A., and K. Fujiwara, "DNS 301 Terminology", RFC 7719, DOI 10.17487/RFC7719, December 302 2015, . 304 Authors' Addresses 306 David C Lawrence 307 Akamai Technologies 308 150 Broadway 309 Cambridge MA 02142-1054 310 USA 312 Email: tale@akamai.com 314 Warren Kumari 315 Google 316 1600 Amphitheatre Parkway 317 Mountain View CA 94043 318 USA 320 Email: warren@kumari.net