idnits 2.17.1 

draft-tale-dnsop-serve-stale-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  -- The draft header indicates that this document updates RFC1034, but the
     abstract doesn't seem to mention this, which it should.

  -- The draft header indicates that this document updates RFC1035, but the
     abstract doesn't seem to mention this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

     (Using the creation date from RFC1034, updated by this document, for
     RFC5378 checks: 1987-11-01)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 27, 2017) is 2487 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Obsolete informational reference (is this intentional?): RFC 7719
     (Obsoleted by RFC 8499)


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	DNSOP Working Group                                          D. Lawrence
3	Internet-Draft                                       Akamai Technologies
4	Updates: 1034, 1035 (if approved)                              W. Kumari
5	Intended status: Standards Track                                  Google
6	Expires: December 29, 2017                                 June 27, 2017

8	              Serving Stale Data to Improve DNS Resiliency
9	                    draft-tale-dnsop-serve-stale-01

11	Abstract

13	   This draft defines a method for recursive resolvers to use stale DNS
14	   data to avoid outages when authoritative nameservers cannot be
15	   reached to refresh expired data.

17	Ed note

19	   Text inside square brackets ([]) is additional background
20	   information, answers to frequently asked questions, general musings,
21	   etc.  They will be removed before publication.  This document is
22	   being collaborated on in GitHub at <https://github.com/vttale/serve-
23	   stale>.  The most recent version of the document, open issues, etc
24	   should all be available here.  The authors gratefully accept pull
25	   requests.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on December 29, 2017.

44	Copyright Notice

46	   Copyright (c) 2017 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
62	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
63	   3.  Description . . . . . . . . . . . . . . . . . . . . . . . . .   3
64	   4.  Implementation Caveats  . . . . . . . . . . . . . . . . . . .   4
65	   5.  Implementation Status . . . . . . . . . . . . . . . . . . . .   5
66	   6.  Security Considerations . . . . . . . . . . . . . . . . . . .   6
67	   7.  Privacy Considerations  . . . . . . . . . . . . . . . . . . .   6
68	   8.  NAT Considerations  . . . . . . . . . . . . . . . . . . . . .   6
69	   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   6
70	   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   6
71	   11. References  . . . . . . . . . . . . . . . . . . . . . . . . .   6
72	     11.1.  Normative References . . . . . . . . . . . . . . . . . .   6
73	     11.2.  Informative References . . . . . . . . . . . . . . . . .   7
74	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   7

76	1.  Introduction

78	   Traditionally the Time To Live (TTL) of a DNS resource record has
79	   been understood to represent the maximum number of seconds that a
80	   record can be used before it must be discarded, based on its
81	   description and usage in [RFC1035] and clarifications in [RFC2181].
82	   Specifically, [RFC1035] Section 3.2.1 says that it "specifies the
83	   time interval that the resource record may be cached before the
84	   source of the information should again be consulted".

86	   Notably, the original DNS specification does not say that data past
87	   its expiration cannot be used.  This document proposes a method for
88	   how recursive resolvers should handle stale DNS data to balance the
89	   competing needs of resiliency and freshness.  It is predicated on the
90	   observation that authoritative server unavailability can cause
91	   outages even when the underlying data those servers would return is
92	   typically unchanged.

94	   There are a number of reasons why an authoritative server may become
95	   unreachable, including Denial of Service (DoS) attacks, network
96	   issues, and so on.  This document suggests that, if the recursive
97	   server is unable to contact the authoritative server but still has
98	   data for the query name, it essentially extends the TTL of the
99	   existing data on the assumption that "stale bread is better than no
100	   bread".

102	   Several major recursive resolver operations currently use stale data
103	   for answers in some way, including Akamai, OpenDNS, and Xerocole.

105	2.  Terminology

107	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
108	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
109	   document are to be interpreted as described in [RFC2119].

111	   For a comprehensive treatment of DNS terms, please see [RFC7719].

113	3.  Description

115	   Three notable timers drive considerations for the use of stale data,
116	   as follows:

118	   o  A client response timer, which is the maximum amount of time a
119	      recursive resolver should allow between the receipt of a
120	      resolution request and sending its response.

122	   o  A query resolution timer, which caps the total amount of time a
123	      recursive resolver spends processing the query.

125	   o  A maximum stale timer, which caps the amount of time that records
126	      will be kept past their expiration.

128	   Recursive resolvers already have the second timer; the first and
129	   third timers are new concepts for this mechanism.

131	   When a request is received by the recursive resolver, it SHOULD start
132	   the client response timer.  This timer is used to avoid client
133	   timeouts.  It SHOULD be configurable, with a recommended value of 1.8
134	   seconds.

136	   The resolver then checks its cache for an unexpired answer.  If it
137	   finds none and the Recursion Desired flag is not set in the request,
138	   it SHOULD immediately return the response without consulting the
139	   cache for expired records.

141	   If iterative lookups will be done, it SHOULD start the query
142	   resolution timer.  This timer bounds the work done by the resolver,
143	   and is commonly around 10 to 30 seconds. [ BIND 9 used to use a hard-
144	   coded constant of 30 seconds and has more recently added a
145	   configuration parameter that defaults to 10 seconds and is capped at
146	   30.  A rigorous exploration of other implementations has not yet been
147	   done. ]

149	   If the answer has not been completely determined by the time the
150	   client response timer has elapsed, the resolver SHOULD then check its
151	   cache to see whether there is expired data that would satisfy the
152	   request.  If so, it adds that data to the response message and SHOULD
153	   set the TTL of each expired record in the message to 1 second.  The
154	   response is then sent to the client while the resolver continues its
155	   attempt to refresh the data.  1 second was chosen because
156	   historically 0 second TTLs have been problematic for some
157	   implementations.  It not only sidesteps those potential problems with
158	   no practical negative consequence, it would also rate limit further
159	   queries from any client that is honoring the TTL, such as a
160	   forwarding resolver.

162	   The maximum stale timer is used for cache management and is
163	   independent of the query resolution process.  This timer is
164	   conceptually different from the maximum cache TTL that exists in many
165	   resolvers, the latter being a clamp on the value of TTLs as received
166	   from authoritative servers.  The maximum stale timer SHOULD be
167	   configurable, and defines the length of time after a record expires
168	   that it SHOULD be retained in the cache.  The suggested value is 7
169	   days, which gives time to notice the resolution problem and for human
170	   intervention for fixing it.

172	   This same basic technique MAY be used to handle stale data associated
173	   with delegations.  If authoritative server addresses are not able to
174	   be refreshed, resolution can possibly still be successful if the
175	   authoritative servers themselves are still up.

177	4.  Implementation Caveats

179	   Answers from authoritative servers that have a DNS Response Code of
180	   either 0 (NOERROR) or 3 (NXDOMAIN) MUST be considered to have
181	   refreshed the data at the resolver.  In particular, this means that
182	   this method is not meant to protect against operator error at the
183	   authoritative server that turns a name that is intended to be valid
184	   into one that is non-existent, because there is no way for a resolver
185	   to know intent.

187	   [ Paul Vixie has suggested that it be made explicit that an auth
188	   NXDOMAIN cause all data, even stale data, below the NXDOMAIN to also
189	   be removed, a la https://datatracker.ietf.org/doc/draft-vixie-dnsext-
190	   resimprove/.  Conceptually this certainly has its appeal but
191	   addressing it in this document when resimprove has not progressed has
192	   procedural problems.  This paragraph will be removed in the next
193	   draft, either dropping the idea here completely or blessing it based
194	   on positive feedback to do so. ]

196	   Resolution is given a chance to succeed before stale data is used to
197	   adhere to the original intent of the design of the DNS.  This
198	   mechanism is only intended to add robustness to failures, and to be
199	   enabled all the time.  If stale data were used immediately and then a
200	   cache refresh attempted after the client response has been sent, the
201	   resolver would frequently be sending data that it would have had no
202	   trouble refreshing.

204	   It is important to continue the resolution attempt after the stale
205	   response has been sent, until the query resolution timeout, because
206	   some pathological resolutions can take many seconds to succeed as
207	   they cope with unavailable servers, bad networks, and other problems.
208	   Stopping the resolution attempt when the response with expired data
209	   has been sent would mean that answers in these pathological cases
210	   would never be refreshed.

212	   Canonical Name (CNAME) records mingled in the expired cache with
213	   other records at the same owner name can cause surprising results.
214	   This was observed with an initial implementation in BIND, where a
215	   hostname changed from having an IPv4 Address (A) record.to a CNAME.
216	   The version of BIND being used did not evict other types in the cache
217	   when a CNAME was received, which in normal operations is not a
218	   significant issue.  However, after both records expired and the
219	   authorities became unavailable, the fallback to stale answers
220	   returned the older A instead of the newer CNAME.

222	   [ This probably applies to other occluding types, so more thought
223	   should be given to the overall issue. ]

225	   Keeping records around after their normal expiration will of course
226	   cause caches to grow larger than if records were removed at their
227	   TTL.  Specific guidance on managing cache sizes is outside the scope
228	   of this document.  Some areas for consideration include whether to
229	   track the popularity of names in client requests versus evicting by
230	   maximum age, and whether to provide a feature for manually flushing
231	   only stale records.

233	5.  Implementation Status

235	   [RFC Editor: per RFC 6982 this section should be removed prior to
236	   publication.]

238	   The algorithm described in this draft was originally implemented as a
239	   patch to BIND 9.7.0.  It has been in production on Akamai's
240	   production network since 2011, and effectively smoothed over
241	   transient failures and longer outages that would have resulted in
242	   major incidents.  The patch has been contributed to the Internet
243	   Systems Consortium in anticipation that it will be incorporated to
244	   their main BIND distribution.

246	   Unbound has a similar feature for serving stale answers, but it works
247	   in a very different way by returning whatever cached answer it has
248	   before trying to refresh expired records.

250	6.  Security Considerations

252	   The most obvious security issue is the increased likelihood of DNSSEC
253	   validation failures when using stale data because signatures could be
254	   returned outside their validity period.  This would only be an issue
255	   if the authoritative servers are unreachable, the only time the
256	   techniques in this document are used, and thus does not introduce a
257	   new failure in place of what would have otherwise been success.

259	   Additionally, bad actors have been known to use DNS caches to keep
260	   records alive even after their authorities have gone away.  This
261	   makes that easier.

263	7.  Privacy Considerations

265	   This document does not add any practical new privacy issues.

267	8.  NAT Considerations

269	   The method described here is not affected by the use of NAT devices.

271	9.  IANA Considerations

273	   This document contains no actions for IANA.  This section will be
274	   removed during conversion into an RFC by the RFC editor.

276	10.  Acknowledgements

278	   The authors wish to thank Matti Klock, Mukund Sivaraman, Jean Roy,
279	   and Jason Moreau for initial review.

281	11.  References

283	11.1.  Normative References

285	   [RFC1035]  Mockapetris, P., "Domain names - implementation and
286	              specification", STD 13, RFC 1035, DOI 10.17487/RFC1035,
287	              November 1987, <http://www.rfc-editor.org/info/rfc1035>.

289	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
290	              Requirement Levels", BCP 14, RFC 2119,
291	              DOI 10.17487/RFC2119, March 1997,
292	              <http://www.rfc-editor.org/info/rfc2119>.

294	   [RFC2181]  Elz, R. and R. Bush, "Clarifications to the DNS
295	              Specification", RFC 2181, DOI 10.17487/RFC2181, July 1997,
296	              <http://www.rfc-editor.org/info/rfc2181>.

298	11.2.  Informative References

300	   [RFC7719]  Hoffman, P., Sullivan, A., and K. Fujiwara, "DNS
301	              Terminology", RFC 7719, DOI 10.17487/RFC7719, December
302	              2015, <http://www.rfc-editor.org/info/rfc7719>.

304	Authors' Addresses

306	   David C Lawrence
307	   Akamai Technologies
308	   150 Broadway
309	   Cambridge  MA 02142-1054
310	   USA

312	   Email: tale@akamai.com

314	   Warren Kumari
315	   Google
316	   1600 Amphitheatre Parkway
317	   Mountain View  CA 94043
318	   USA

320	   Email: warren@kumari.net