idnits 2.17.1 draft-tale-dnsop-serve-stale-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 13, 2017) is 2601 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Obsolete informational reference (is this intentional?): RFC 7719 (Obsoleted by RFC 8499) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 DNSOP Working Group D. Lawrence 3 Internet-Draft Akamai Technologies 4 Intended status: Standards Track W. Kumari 5 Expires: September 14, 2017 Google 6 March 13, 2017 8 Serving Stale Data to Improve DNS Resiliency 9 draft-tale-dnsop-serve-stale-00 11 Abstract 13 This draft defines a method for recursive resolvers to use stale DNS 14 data to avoid outages when authoritative nameservers cannot be 15 reached to refresh expired data. 17 Ed note 19 Text inside square brackets ([]) is additional background 20 information, answers to frequently asked questions, general musings, 21 etc. They will be removed before publication. This document is 22 being collaborated on in GitHub at . The most recent version of the document, open issues, etc 24 should all be available here. The authors gratefully accept pull 25 requests. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on September 14, 2017. 44 Copyright Notice 46 Copyright (c) 2017 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 62 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 3. Description . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 4. Implementation Caveats . . . . . . . . . . . . . . . . . . . 4 65 5. Implementation Status . . . . . . . . . . . . . . . . . . . . 5 66 6. Security Considerations . . . . . . . . . . . . . . . . . . . 5 67 7. Privacy Considerations . . . . . . . . . . . . . . . . . . . 6 68 8. NAT Considerations . . . . . . . . . . . . . . . . . . . . . 6 69 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6 70 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 6 71 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 6 72 11.1. Normative References . . . . . . . . . . . . . . . . . . 6 73 11.2. Informative References . . . . . . . . . . . . . . . . . 6 74 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 7 76 1. Introduction 78 Traditionally the Time To Live (TTL) of a DNS resource record has 79 been understood to represent the maximum number of seconds that a 80 record can be used before it must be discarded, based on its 81 description and usage in [RFC1035] and clarifications in [RFC2181]. 82 Specifically, [RFC1035] Section 3.2.1 says that it "specifies the 83 time interval that the resource record may be cached before the 84 source of the information should again be consulted". 86 Notably, the original DNS specification does not say that data past 87 its expiration cannot be used. This document proposes a method for 88 how recursive resolvers should handle stale DNS data to balance the 89 competing needs of resiliency and freshness. It is predicated on the 90 observation that authoritative server unavailability can cause 91 outages even when the underlying data those servers would return is 92 typically unchanged. 94 There are a number of reasons why an authoritative server may become 95 unreachable, including Denial of Service (DoS) attacks, network 96 issues, and so on. This document suggests that, if the recursive 97 server is unable to contact the authoritative server but still has 98 data for the query name, it essentially extends the TTL of the 99 existing data on the assumption that "stale bread is better than no 100 bread". 102 Several major recursive resolver operations currently use stale data 103 for answers in some way, including Akamai, OpenDNS, and Xerocole. 105 2. Terminology 107 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 108 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 109 document are to be interpreted as described in [RFC2119]. 111 For a comprehensive treatment of DNS terms, please see [RFC7719]. 113 3. Description 115 Three notable timers drive considerations for the use of stale data, 116 as follows: 118 o A client response timer, which is the maximum amount of time a 119 recursive resolver should allow between the receipt of a 120 resolution request and sending its response. 122 o A query resolution timer, which caps the total amount of time a 123 recursive resolver spends processing the query. 125 o A maximum stale timer, which caps the amount of time that records 126 will be kept past their expiration. 128 Recursive resolvers already have the second timer; the first and 129 third timers are new concepts for this mechanism. 131 When a request is received by the recursive resolver, it SHOULD start 132 the client response timer. This timer is used to avoid client 133 timeouts. It SHOULD be configurable, with a recommended value of 1.8 134 seconds. 136 The resolver then checks its cache for an unexpired answer. If it 137 finds none and the Recursion Desired flag is not set in the request, 138 it SHOULD immediately return the response without consulting the 139 cache for expired records. 141 If iterative lookups will be done, it SHOULD start the query 142 resolution timer. This timer bounds the work done by the resolver, 143 and is commonly around 10 to 30 seconds. [ BIND 9 used to use a hard- 144 coded constant of 30 seconds and has more recently added a 145 configuration parameter that defaults to 10 seconds and is capped at 146 30. A rigorous exploration of other implementations has not yet been 147 done. ] 149 If the answer has not been completely determined by the time the 150 client response timer has elapsed, the resolver SHOULD then check its 151 cache to see whether there is expired data that would satisfy the 152 request. If so, it adds that data to the response message and SHOULD 153 set the TTL of each expired record in the message to 1 second. [ 154 This 1 second TTL is ripe for discussion. ] The response is then sent 155 to the client while the resolver continues its attempt to refresh the 156 data. 158 The maximum stale timer is used for cache management and is 159 independent of the query resolution process. This timer is 160 conceptually different from the maximum cache TTL that exists in many 161 resolvers, the latter being a clamp on the value of TTLs as received 162 from authoritative servers. The maximum stale timer SHOULD be 163 configurable, and defines the length of time after a record expires 164 that it SHOULD be retained in the cache. The suggested value is 7 165 days, which gives time to notice the resolution problem and for human 166 intervention for fixing it. 168 This same basic technique MAY be used to handle stale data associated 169 with delegations. If authoritative server addresses are not able to 170 be refreshed, resolution can possibly still be successful if the 171 authoritative servers themselves are still up. 173 4. Implementation Caveats 175 Answers from authoritative servers that have a DNS Response Code of 176 either 0 (NOERROR) or 3 (NXDOMAIN) MUST be considered to have 177 refreshed the data at the resolver. In particular, this means that 178 this method is not meant to protect against operator error at the 179 authoritative server that turns a name that is intended to be valid 180 into one that is non-existent, because there is no way for a resolver 181 to know intent. 183 Resolution is given a chance to succeed before stale data is used to 184 adhere to the original intent of the design of the DNS. This 185 mechanism is only intended to add robustness to failures, and to be 186 enabled all the time. If stale data were used immediately and then a 187 cache refresh attempted after the client response has been sent, the 188 resolver would frequently be sending data that it would have had no 189 trouble refreshing. 191 It is important to continue the resolution attempt after the stale 192 response has been sent, until the query resolution timeout, because 193 some pathological resolutions can take many seconds to succeed as 194 they cope with unavailable servers, bad networks, and other problems. 195 Stopping the resolution attempt when the response with expired data 196 has been sent would mean that answers in these pathological cases 197 would never be refreshed. 199 Canonical Name (CNAME) records mingled in the expired cache with 200 other records at the same owner name can cause surprising results. 201 This was observed with an initial implementation in BIND, where a 202 hostname changed from having a CNAME record to an IPv4 Address (A) 203 record. BIND does not evict CNAMEs in the cache when other types are 204 received, which in normal operations is not an issue. However, after 205 both records expired and the authorities became unavailable, the 206 fallback to stale answers returned the older CNAME instead of the 207 newer A. 209 [ This probably applies to other occluding types, so more thought 210 should be given to the overall issue. It should probably also be 211 rewritten to not suggest that this only a quirk of BIND. ] 213 Keeping records around after their normal expiration will of course 214 cause caches to grow larger than if records were removed at their 215 TTL. Specific guidance on managing cache sizes is outside the scope 216 of this document. Some areas for consideration include whether to 217 track the popularity of names in client requests versus evicting by 218 maximum age, and whether to provide a feature for manually flushing 219 only stale records. 221 5. Implementation Status 223 [RFC Editor: per RFC 6982 this section should be removed prior to 224 publication.] 226 The algorithm described in this draft was originally implemented as a 227 patch to BIND 9.7.0. It has been in production on Akamai's 228 production network since 2011, and effectively smoothed over 229 transient failures and longer outages that would have resulted in 230 major incidents. The patch has been contributed to the Internet 231 Systems Consortium in anticipation that it will be incorporated to 232 their main BIND distribution. 234 6. Security Considerations 236 The most obvious security issue is the increased likelihood of DNSSEC 237 validation failures when using stale data because signatures could be 238 returned outside their validity period. This would only be an issue 239 if the authoritative servers are unreachable, the only time the 240 techniques in this document are used, and thus does not introduce a 241 new failure in place of what would have otherwise been success. 243 Additionally, bad actors have been known to use DNS caches to keep 244 records alive even after their authorities have gone away. This 245 makes that easier. 247 7. Privacy Considerations 249 This document does not add any practical new privacy issues. 251 8. NAT Considerations 253 The method described here is not affected by the use of NAT devices. 255 9. IANA Considerations 257 This document contains no actions for IANA. This section will be 258 removed during conversion into an RFC by the RFC editor. 260 10. Acknowledgements 262 The authors wish to thank Matti Klock, Mukund Sivaraman, Jean Roy, 263 and Jason Moreau for initial review. 265 11. References 267 11.1. Normative References 269 [RFC1035] Mockapetris, P., "Domain names - implementation and 270 specification", STD 13, RFC 1035, DOI 10.17487/RFC1035, 271 November 1987, . 273 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 274 Requirement Levels", BCP 14, RFC 2119, 275 DOI 10.17487/RFC2119, March 1997, 276 . 278 [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS 279 Specification", RFC 2181, DOI 10.17487/RFC2181, July 1997, 280 . 282 11.2. Informative References 284 [RFC7719] Hoffman, P., Sullivan, A., and K. Fujiwara, "DNS 285 Terminology", RFC 7719, DOI 10.17487/RFC7719, December 286 2015, . 288 Authors' Addresses 290 David C Lawrence 291 Akamai Technologies 292 150 Broadway 293 Cambridge MA 02142-1054 294 USA 296 Email: tale@akamai.com 298 Warren Kumari 299 Google 300 1600 Amphitheatre Parkway 301 Mountain View CA 94043 302 USA 304 Email: warren@kumari.net