idnits 2.17.1 

draft-koster-rep-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  -- The document has an IETF Trust Provisions (28 Dec 2009) Section 6.c(i)
     Publication Limitation clause.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (December 08, 2020) is 1234 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '1' on line 424

  -- Looks like a reference, but probably isn't: '2' on line 426

  -- Looks like a reference, but probably isn't: '3' on line 428

  -- Looks like a reference, but probably isn't: '4' on line 430

  -- Looks like a reference, but probably isn't: '5' on line 432

  -- Looks like a reference, but probably isn't: '6' on line 434

  -- Looks like a reference, but probably isn't: '7' on line 436

  -- Looks like a reference, but probably isn't: '8' on line 438

  -- Duplicate reference: RFC2119, mentioned in 'RFC8174', was also mentioned
     in 'RFC2119'.


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	Network Working Group                                          M. Koster
2	Internet-Draft                                Stalworthy Computing, Ltd.
3	Intended status: Informational                                 G. Illyes
4	Expires: June 5, 2021                                          H. Zeller
5	                                                               L. Harvey
6	                                                                  Google
7	                                                       December 08, 2020

9	                       Robots Exclusion Protocol
10	                         draft-koster-rep-04

12	Abstract

14	   This document standardizes and extends the "Robots Exclusion
15	   Protocol" <http://www.robotstxt.org/> method originally defined by
16	   Martijn Koster in 1996 for service owners to control how content
17	   served by their services may be accessed, if at all, by automatic
18	   clients known as crawlers.

20	Status of This Memo

22	   This Internet-Draft is submitted in full conformance with the
23	   provisions of BCP 78 and BCP 79.

25	   Internet-Drafts are working documents of the Internet Engineering
26	   Task Force (IETF).  Note that other groups may also distribute
27	   working documents as Internet-Drafts.  The list of current Internet-
28	   Drafts is at http://datatracker.ietf.org/drafts/current/.

30	   Internet-Drafts are draft documents valid for a maximum of six months
31	   and may be updated, replaced, or obsoleted by other documents at any
32	   time.  It is inappropriate to use Internet-Drafts as reference
33	   material or to cite them other than as "work in progress."

35	   This document may not be modified, and derivative works of it may not
36	   be created, except to format it for publication as an RFC or to
37	   translate it into languages other than English.

39	   This Internet-Draft will expire on June 5, 2021.

41	Copyright Notice

43	   Copyright (c) 2020 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents
48	   (http://trustee.ietf.org/license-info) in effect on the date of
49	   publication of this document.  Please review these documents
50	   carefully, as they describe your rights and restrictions with respect
51	   to this document.  Code Components extracted from this document must
52	   include Simplified BSD License text as described in Section 4.e of
53	   the Trust Legal Provisions and are provided without warranty as
54	   described in the Simplified BSD License.

56	Table of Contents

58	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
59	     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   2
60	   2.  Specification . . . . . . . . . . . . . . . . . . . . . . . .   3
61	     2.1.  Protocol definition . . . . . . . . . . . . . . . . . . .   3
62	     2.2.  Formal syntax . . . . . . . . . . . . . . . . . . . . . .   3
63	       2.2.1.  The user-agent line . . . . . . . . . . . . . . . . .   4
64	       2.2.2.  The Allow and Disallow lines  . . . . . . . . . . . .   4
65	       2.2.3.  Special characters  . . . . . . . . . . . . . . . . .   5
66	       2.2.4.  Other records . . . . . . . . . . . . . . . . . . . .   6
67	     2.3.  Access method . . . . . . . . . . . . . . . . . . . . . .   6
68	       2.3.1.  Access results  . . . . . . . . . . . . . . . . . . .   7
69	     2.4.  Caching . . . . . . . . . . . . . . . . . . . . . . . . .   8
70	     2.5.  Limits  . . . . . . . . . . . . . . . . . . . . . . . . .   8
71	     2.6.  Security Considerations . . . . . . . . . . . . . . . . .   8
72	     2.7.  IANA Considerations . . . . . . . . . . . . . . . . . . .   8
73	   3.  Examples  . . . . . . . . . . . . . . . . . . . . . . . . . .   8
74	     3.1.  Simple example  . . . . . . . . . . . . . . . . . . . . .   8
75	     3.2.  Longest Match . . . . . . . . . . . . . . . . . . . . . .   9
76	   4.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
77	     4.1.  Normative References  . . . . . . . . . . . . . . . . . .   9
78	     4.2.  URIs  . . . . . . . . . . . . . . . . . . . . . . . . . .   9
79	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  10

81	1.  Introduction

83	   This document applies to services that provide resources that clients
84	   can access through URIs as defined in RFC3986 [1].  For example, in
85	   the context of HTTP, a browser is a client that displays the content
86	   of a web page.

88	   Crawlers are automated clients.  Search engines for instance have
89	   crawlers to recursively traverse links for indexing as defined in
90	   RFC8288 [2].

92	   It may be inconvenient for service owners if crawlers visit the
93	   entirety of their URI space.  This document specifies the rules that
94	   crawlers MUST obey when accessing URIs.

96	   These rules are not a form of access authorization.

98	1.1.  Terminology

100	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
101	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
102	   "OPTIONAL" in this document are to be interpreted as described in
103	   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
104	   capitals, as shown here.

106	2.  Specification

108	2.1.  Protocol definition

110	   The protocol language consists of rule(s) and group(s):

112	   o  *Rule*: A line with a key-value pair that defines how a crawler
113	      may access URIs. See section The Allow and Disallow lines.

115	   o  *Group*: One or more user-agent lines that is followed by one or
116	      more rules. The group is terminated by a user-agent line or end
117	      of file. See User-agent line. The last group may have no rules,
118	      which means it implicitly allows everything.

120	2.2.  Formal syntax

122	   Below is an Augmented Backus-Naur Form (ABNF) description, as
123	   described in RFC5234 [3].

125	   robotstxt = *(group / emptyline)
126	   group = startgroupline                ; We start with a user-agent
127	           *(startgroupline / emptyline) ; ... and possibly more
128	                                         ; user-agents
129	           *(rule / emptyline)           ; followed by rules relevant
130	                                         ; for UAs

132	   startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL

134	   rule = *WS ("allow" / "disallow") *WS ":"
135	          *WS (path-pattern / empty-pattern) EOL

137	   ; parser implementors: add additional lines you need (for
138	   ; example Sitemaps), and be lenient when reading lines that don't
139	   ; conform. Apply Postel's law.

141	   product-token = identifier / "*"
142	   path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern
143	   empty-pattern = *WS

145	   identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A)
146	   comment = "#" *(UTF8-char-noctl / WS / "#")
147	   emptyline = EOL
148	   EOL = *WS [comment] NL ; end-of-line may have
149	                          ; optional trailing comment
150	   NL = %x0D / %x0A / %x0D.0A
151	   WS = %x20 / %x09
152	   ; UTF8 derived from RFC3629, but excluding control characters

154	   UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
155	   UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
156	   UTF8-2 = %xC2-DF UTF8-tail
157	   UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
158	            %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
159	   UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
160	            %xF4 %x80-8F 2UTF8-tail

162	   UTF8-tail = %x80-BF

164	2.2.1.  The user-agent line

166	   Crawlers set a product token to find relevant groups.  The product
167	   token MUST contain only "a-zA-Z_-" characters.  The product token
168	   SHOULD be part of the identification string that the crawler sends
169	   to the service (for example, in the case of HTTP, the product name
170	   SHOULD be in the user-agent header).  The identification string
171	   SHOULD describe the purpose of the crawler.  Here's an example of an
172	   HTTP header with a link pointing to a page describing the purpose of
173	   the ExampleBot crawler which appears both in the HTTP header and as a
174	   product token:

176	   +-------------------------------------------------+-----------------+
177	   | HTTP header                                     | robots.txt      |
178	   |                                                 | user-agent line |
179	   +-------------------------------------------------+-----------------+
180	   | user-agent: Mozilla/5.0 (compatible;            | user-agent:     |
181	   | ExampleBot/0.1;                                 | ExampleBot      |
182	   | https://www.example.com/bot.html)               |                 |
183	   +-------------------------------------------------+-----------------+

185	   Crawlers MUST find the group that matches the product token exactly,
186	   and then obey the rules of the group.  If there is more than one
187	   group matching the user-agent, the matching groups' rules MUST be
188	   combined into one group.  The matching MUST be case-insensitive.  If
189	   no matching group exists, crawlers MUST obey the first group with a
190	   user-agent line with a "*" value, if present.  If no group satisfies
191	   either condition, or no groups are present at all, no rules apply.

193	2.2.2.  The Allow and Disallow lines

195	   These lines indicate whether accessing a URI that matches the
196	   corresponding path is allowed or disallowed.

198	   To evaluate if access to a URI is allowed, a robot MUST match the
199	   paths in allow and disallow rules against the URI.  The matching
200	   SHOULD be case sensitive.  The most specific match found MUST be
201	   used.  The most specific match is the match that has the most octets.
202	   If an allow and disallow rule is equivalent, the allow SHOULD be
203	   used.  If no match is found amongst the rules in a group for a
204	   matching user-agent, or there are no rules in the group, the URI is
205	   allowed.  The /robots.txt URI is implicitly allowed.

207	   Octets in the URI and robots.txt paths outside the range of the US-
208	   ASCII coded character set, and those in the reserved range defined by
209	   RFC3986 [1], MUST be percent-encoded as defined by RFC3986 [1] prior
210	   to comparison.

212	   If a percent-encoded US-ASCII octet is encountered in the URI, it
213	   MUST be unencoded prior to comparison, unless it is a reserved
214	   character in the URI as defined by RFC3986 [1] or the character is
215	   outside the unreserved character range.  The match evaluates
216	   positively if and only if the end of the path from the rule is
217	   reached before a difference in octets is encountered.

219	   For example:

221	   +-------------------+-----------------------+-----------------------+
222	   | Path              | Encoded Path          | Path to match         |
223	   +-------------------+-----------------------+-----------------------+
224	   | /foo/bar?baz=quz  | /foo/bar?baz=quz      | /foo/bar?baz=quz      |
225	   |                   |                       |                       |
226	   | /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% |
227	   | ://foo.bar        | 2F%2Ffoo.bar          | 2F%2Ffoo.bar          |
228	   |                   |                       |                       |
229	   | /foo/bar/U+E38384 | /foo/bar/%E3%83%84    | /foo/bar/%E3%83%84    |
230	   |                   |                       |                       |
231	   | /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84    | /foo/bar/%E3%83%84    |
232	   | 4                 |                       |                       |
233	   |                   |                       |                       |
234	   | /foo/bar/%62%61%7 | /foo/bar/%62%61%7A    | /foo/bar/baz          |
235	   | A                 |                       |                       |
236	   +-------------------+-----------------------+-----------------------+

238	   The crawler SHOULD ignore "disallow" and "allow" rules that are not
239	   in any group (for example, any rule that precedes the first user-
240	   agent line).

242	   Implementers MAY bridge encoding mismatches if they detect that the
243	   robots.txt file is not UTF8 encoded.

245	2.2.3.  Special characters

247	   Crawlers SHOULD allow the following special characters:

249	   +-----------+--------------------------------+----------------------+
250	   | Character | Description                    | Example              |
251	   +-----------+--------------------------------+----------------------+
252	   | "#"       | Designates an end of line      | "allow: / # comment  |
253	   |           | comment.                       | in line"             |
254	   |           |                                |                      |
255	   |           |                                | "# comment at the    |
256	   |           |                                | end"                 |
257	   |           |                                |                      |
258	   | "$"       | Designates the end of the      | "allow:              |
259	   |           | match pattern. A URI MUST end  | /this/path/exactly$" |
260	   |           | with a $.                      |                      |
261	   |           |                                |                      |
262	   | "*"       | Designates 0 or more instances | "allow:              |
263	   |           | of any character.              | /this/*/exactly"     |
264	   +-----------+--------------------------------+----------------------+

266	   If crawlers match special characters verbatim in the URI, crawlers
267	   SHOULD use "%" encoding.  For example:

269	   +------------------------+------------------------------------------+
270	   | Pattern                | URI                                      |
271	   +------------------------+------------------------------------------+
272	   | /path/file-            | https://www.example.com/path/file-       |
273	   | with-a-%2A.html        | with-a-*.html                            |
274	   |                        |                                          |
275	   | /path/foo-%24          | https://www.example.com/path/foo-$       |
276	   +------------------------+------------------------------------------+

278	2.2.4.  Other records

280	   Clients MAY interpret other records that are not part of the
281	   robots.txt protocol.  For example, 'sitemap' [4].

283	2.3.  Access method

285	   The rules MUST be accessible in a file named "/robots.txt" (all lower
286	   case) in the top level path of the service.  The file MUST be UTF-8
287	   encoded (as defined in RFC3629 [5]) and Internet Media Type "text/
288	   plain" (as defined in RFC2046 [6]).

290	   As per RFC3986 [1], the URI of the robots.txt is:

292	   "scheme:[//authority]/robots.txt"

294	   For example, in the context of HTTP or FTP, the URI is:

296	       http://www.example.com/robots.txt
297	       https://www.example.com/robots.txt

299	       ftp://ftp.example.com/robots.txt

301	2.3.1.  Access results

303	2.3.1.1.  Successful access

305	   If the crawler successfully downloads the robots.txt, the crawler
306	   MUST follow the parseable rules.

308	2.3.1.2.  Redirects

310	   The server may respond to a robots.txt fetch request with a redirect,
311	   such as HTTP 301 and HTTP 302.  The crawlers SHOULD follow at least
312	   five consecutive redirects, even across authorities (for example
313	   hosts in case of HTTP), as defined in RFC1945 [7].

315	   If a robots.txt file is reached within five consecutive redirects,
316	   the robots.txt file MUST be fetched, parsed, and its rules followed
317	   in the context of the initial authority.

319	   If there are more than five consecutive redirects, crawlers MAY
320	   assume that the robots.txt is unavailable.

322	2.3.1.3.  Unavailable status

324	   Unavailable means the crawler tries to fetch the robots.txt, and the
325	   server responds with unavailable status codes.  For example, in the
326	   context of HTTP, unavailable status codes are in the 400-499 range.

328	   If a server status code indicates that the robots.txt file is
329	   unavailable to the client, then crawlers MAY access any resources on
330	   the server or MAY use a cached version of a robots.txt file for up to
331	   24 hours.

333	2.3.1.4.  Unreachable status

335	   If the robots.txt is unreachable due to server or network errors,
336	   this means the robots.txt is undefined and the crawler MUST assume
337	   complete disallow.  For example, in the context of HTTP, an
338	   unreachable robots.txt has a response code in the 500-599 range.  For
339	   other undefined status codes, the crawler MUST assume the robots.txt
340	   is unreachable.

342	   If the robots.txt is undefined for a reasonably long period of time
343	   (for example, 30 days), clients MAY assume the robots.txt is
344	   unavailable or continue to use a cached copy.

346	2.3.1.5.  Parsing errors

348	   Crawlers SHOULD try to parse each line of the robots.txt file.
349	   Crawlers MUST use the parseable rules.

351	2.4.  Caching

353	   Crawlers MAY cache the fetched robots.txt file's contents.  Crawlers
354	   MAY use standard cache control as defined in RFC2616 [8].  Crawlers
355	   SHOULD NOT use the cached version for more than 24 hours, unless the
356	   robots.txt is unreachable.

358	2.5.  Limits

360	   Crawlers MAY impose a parsing limit that MUST be at least 500
361	   kibibytes (KiB).

363	2.6.  Security Considerations

365	   The Robots Exclusion Protocol MUST NOT be used as a form of security
366	   measures. Listing URIs in the robots.txt file exposes the URI
367	   publicly and thus making the URIs discoverable.

369	2.7.  IANA Considerations.

371	   This document has no actions for IANA.

373	3.  Examples

375	3.1.  Simple example

377	   The following example shows:

379	   o  *foobot*: A regular case.  A single user-agent token followed by
380	      rules.
381	   o  *barbot and bazbot*: A group that's relevant for more than one
382	      user-agent.
383	   o  *quxbot:* Empty group at end of file.

385	   <CODE BEGINS>
386	   User-Agent : foobot
387	   Disallow : /example/page.html
388	   Disallow : /example/disallowed.gif

390	   User-Agent : barbot
391	   User-Agent : bazbot
392	   Allow : /example/page.html
393	   Disallow : /example/disallowed.gif

395	   User-Agent: quxbot

397	   EOF
398	   <CODE ENDS>

400	3.2.  Longest Match

402	   The following example shows that in the case of a two rules, the
403	   longest one MUST be used for matching.  In the following case,
404	   /example/page/disallowed.gif MUST be used for the URI
405	   example.com/example/page/disallow.gif .

407	   <CODE BEGINS>
408	   User-Agent : foobot
409	   Allow : /example/page/
410	   Disallow : /example/page/disallowed.gif
411	   <CODE ENDS>

413	4.  References

415	4.1.  Normative References

417	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
418	              Requirement Levels", BCP 14, RFC 2119, March 1997.
419	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in
420	              RFC 2119 Key Words", BCP 14, RFC 2119, May 2017.

422	4.2.  URIs

424	   [1] https://tools.ietf.org/html/rfc3986

426	   [2] https://tools.ietf.org/html/rfc8288

428	   [3] https://tools.ietf.org/html/rfc5234

430	   [4] https://www.sitemaps.org/index.html

432	   [5] https://tools.ietf.org/html/rfc3629

434	   [6] https://tools.ietf.org/html/rfc2046

436	   [7] https://tools.ietf.org/html/rfc1945

438	   [8] https://tools.ietf.org/html/rfc2616

440	Authors' Address

442	   Martijn Koster
443	   Stalworthy Manor Farm
444	   Suton Lane, NR18 9JG
445	   Wymondham, Norfolk
446	   United Kingdom
447	   Email: m.koster@greenhills.co.uk

449	   Gary Illyes
450	   Brandschenkestrasse 110
451	   8002, Zurich
452	   Switzerland
453	   Email: garyillyes@google.com

455	   Henner Zeller
456	   1600 Amphitheatre Pkwy
457	   Mountain View, CA 94043
458	   USA
459	   Email: henner@google.com

461	   Lizzi Harvey
462	   1600 Amphitheatre Pkwy
463	   Mountain View, CA 94043
464	   USA
465	   Email: lizzi@google.com