idnits 2.17.1 

draft-koster-rep-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (5 May 2022) is 720 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231,
     RFC 7232, RFC 7233, RFC 7234, RFC 7235)


     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                     M. Koster, Ed.
3	Internet-Draft                                Stalworthy Computing, Ltd.
4	Intended status: Informational                            G. Illyes, Ed.
5	Expires: 6 November 2022                                  H. Zeller, Ed.
6	                                                         L. Sassman, Ed.
7	                                                             Google LLC.
8	                                                              5 May 2022

10	                       Robots Exclusion Protocol
11	                          draft-koster-rep-07

13	Abstract

15	   This document specifies and extends the "Robots Exclusion Protocol"
16	   method originally defined by Martijn Koster in 1996 for service
17	   owners to control how content served by their services may be
18	   accessed, if at all, by automatic clients known as crawlers.

20	Status of This Memo

22	   This Internet-Draft is submitted in full conformance with the
23	   provisions of BCP 78 and BCP 79.

25	   Internet-Drafts are working documents of the Internet Engineering
26	   Task Force (IETF).  Note that other groups may also distribute
27	   working documents as Internet-Drafts.  The list of current Internet-
28	   Drafts is at https://datatracker.ietf.org/drafts/current/.

30	   Internet-Drafts are draft documents valid for a maximum of six months
31	   and may be updated, replaced, or obsoleted by other documents at any
32	   time.  It is inappropriate to use Internet-Drafts as reference
33	   material or to cite them other than as "work in progress."

35	   This Internet-Draft will expire on 6 November 2022.

37	Copyright Notice

39	   Copyright (c) 2022 IETF Trust and the persons identified as the
40	   document authors.  All rights reserved.

42	   This document is subject to BCP 78 and the IETF Trust's Legal
43	   Provisions Relating to IETF Documents (https://trustee.ietf.org/
44	   license-info) in effect on the date of publication of this document.
45	   Please review these documents carefully, as they describe your rights
46	   and restrictions with respect to this document.  Code Components
47	   extracted from this document must include Revised BSD License text as
48	   described in Section 4.e of the Trust Legal Provisions and are
49	   provided without warranty as described in the Revised BSD License.

51	Table of Contents

53	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
54	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
55	   2.  Specification . . . . . . . . . . . . . . . . . . . . . . . .   3
56	     2.1.  Protocol Definition . . . . . . . . . . . . . . . . . . .   3
57	     2.2.  Formal Syntax . . . . . . . . . . . . . . . . . . . . . .   3
58	       2.2.1.  The User-Agent Line . . . . . . . . . . . . . . . . .   5
59	       2.2.2.  The Allow and Disallow Lines  . . . . . . . . . . . .   5
60	       2.2.3.  Special Characters  . . . . . . . . . . . . . . . . .   6
61	       2.2.4.  Other Records . . . . . . . . . . . . . . . . . . . .   7
62	     2.3.  Access Method . . . . . . . . . . . . . . . . . . . . . .   7
63	       2.3.1.  Access Results  . . . . . . . . . . . . . . . . . . .   8
64	         2.3.1.1.  Successful Access . . . . . . . . . . . . . . . .   8
65	         2.3.1.2.  Redirects . . . . . . . . . . . . . . . . . . . .   8
66	         2.3.1.3.  Unavailable Status  . . . . . . . . . . . . . . .   8
67	         2.3.1.4.  Unreachable Status  . . . . . . . . . . . . . . .   9
68	         2.3.1.5.  Parsing Errors  . . . . . . . . . . . . . . . . .   9
69	     2.4.  Caching . . . . . . . . . . . . . . . . . . . . . . . . .   9
70	     2.5.  Limits  . . . . . . . . . . . . . . . . . . . . . . . . .   9
71	   3.  Security Considerations . . . . . . . . . . . . . . . . . . .   9
72	   4.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
73	   5.  Examples  . . . . . . . . . . . . . . . . . . . . . . . . . .   9
74	     5.1.  Simple Example  . . . . . . . . . . . . . . . . . . . . .   9
75	     5.2.  Longest Match . . . . . . . . . . . . . . . . . . . . . .  10
76	   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  10
77	     6.1.  Normative References  . . . . . . . . . . . . . . . . . .  10
78	     6.2.  Informative References  . . . . . . . . . . . . . . . . .  11
79	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  11

81	1.  Introduction

83	   This document applies to services that provide resources that clients
84	   can access through URIs as defined in [RFC3986].  For example, in the
85	   context of HTTP, a browser is a client that displays the content of a
86	   web page.

88	   Crawlers are automated clients.  Search engines for instance have
89	   crawlers to recursively traverse links for indexing as defined in
90	   [RFC8288].

92	   It may be inconvenient for service owners if crawlers visit the
93	   entirety of their URI space.  This document specifies the rules
94	   originally defined by the "Robots Exclusion Protocol" [ROBOTSTXT]
95	   that crawlers are expected to obey when accessing URIs.

97	   These rules are not a form of access authorization.

99	1.1.  Requirements Language

101	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
102	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
103	   "OPTIONAL" in this document are to be interpreted as described in BCP
104	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
105	   capitals, as shown here.

107	2.  Specification

109	2.1.  Protocol Definition

111	   The protocol language consists of rule(s) and group(s) that the
112	   service makes available in a file named 'robots.txt' as described in
113	   section 2.3:

115	   *  Rule: A line with a key-value pair that defines how a crawler may
116	      access URIs.  See section 2.2.2.

118	   *  Group: One or more user-agent lines that is followed by one or
119	      more rules.  The group is terminated by a user-agent line or end
120	      of file.  See section 2.2.1.  The last group may have no rules,
121	      which means it implicitly allows everything.

123	2.2.  Formal Syntax

125	   Below is an Augmented Backus-Naur Form (ABNF) description, as
126	   described in [RFC5234].

128	    robotstxt = *(group / emptyline)
129	    group = startgroupline                ; We start with a user-agent
130	           *(startgroupline / emptyline)  ; ... and possibly more
131	                                          ; user-agents
132	           *(rule / emptyline)            ; followed by rules relevant
133	                                          ; for UAs

135	    startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL

137	    rule = *WS ("allow" / "disallow") *WS ":"
138	          *WS (path-pattern / empty-pattern) EOL

140	    ; parser implementors: add additional lines you need (for
141	    ; example, sitemaps), and be lenient when reading lines that don't
142	    ; conform. Apply Postel's law.

144	    product-token = identifier / "*"
145	    path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern
146	    empty-pattern = *WS

148	    identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A)
149	    comment = "#" *(UTF8-char-noctl / WS / "#")
150	    emptyline = EOL
151	    EOL = *WS [comment] NL ; end-of-line may have
152	                           ; optional trailing comment
153	    NL = %x0D / %x0A / %x0D.0A
154	    WS = %x20 / %x09

156	    ; UTF8 derived from RFC3629, but excluding control characters

158	    UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
159	    UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
160	    UTF8-2 = %xC2-DF UTF8-tail
161	    UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
162	             %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
163	    UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
164	             %xF4 %x80-8F 2UTF8-tail

166	    UTF8-tail = %x80-BF

168	2.2.1.  The User-Agent Line

170	   Crawlers set a product token to find relevant groups.  The product
171	   token MUST contain only "a-zA-Z_-" characters.  The product token
172	   SHOULD be part of the identification string that the crawler sends to
173	   the service (for example, in the case of HTTP, the product name
174	   SHOULD be in the user-agent header).  The identification string
175	   SHOULD describe the purpose of the crawler.  Here's an example of an
176	   HTTP header with a link pointing to a page describing the purpose of
177	   the ExampleBot crawler which appears both in the HTTP header and as a
178	   product token:

180	          +===================================+=================+
181	          | HTTP header                       | robots.txt      |
182	          |                                   | user-agent line |
183	          +===================================+=================+
184	          | user-agent: Mozilla/5.0           | user-agent:     |
185	          | (compatible; ExampleBot/0.1;      | ExampleBot      |
186	          | https://www.example.com/bot.html) |                 |
187	          +-----------------------------------+-----------------+

189	             Table 1: Example of a user-agent header and user-
190	                   agent robots.txt token for ExampleBot

192	   Crawlers MUST find the group that matches the product token exactly,
193	   and then obey the rules of the group.  If there is more than one
194	   group matching the user-agent, the matching groups' rules MUST be
195	   combined into one group.  The matching MUST be case-insensitive.  If
196	   no matching group exists, crawlers MUST obey the first group with a
197	   user-agent line with a "*" value, if present.  If no group satisfies
198	   either condition, or no groups are present at all, no rules apply.

200	2.2.2.  The Allow and Disallow Lines

202	   These lines indicate whether accessing a URI that matches the
203	   corresponding path is allowed or disallowed.

205	   To evaluate if access to a URI is allowed, a robot MUST match the
206	   paths in allow and disallow rules against the URI.  The matching
207	   SHOULD be case sensitive.  The most specific match found MUST be
208	   used.  The most specific match is the match that has the most octets.
209	   If an allow and disallow rule is equivalent, the allow SHOULD be
210	   used.  If no match is found amongst the rules in a group for a
211	   matching user-agent, or there are no rules in the group, the URI is
212	   allowed.  The /robots.txt URI is implicitly allowed.

214	   Octets in the URI and robots.txt paths outside the range of the US-
215	   ASCII coded character set, and those in the reserved range defined by
216	   [RFC3986], MUST be percent-encoded as defined by [RFC3986] prior to
217	   comparison.

219	   If a percent-encoded US-ASCII octet is encountered in the URI, it
220	   MUST be unencoded prior to comparison, unless it is a reserved
221	   character in the URI as defined by [RFC3986] or the character is
222	   outside the unreserved character range.  The match evaluates
223	   positively if and only if the end of the path from the rule is
224	   reached before a difference in octets is encountered.

226	   For example:

228	    +===================+======================+======================+
229	    | Path              | Encoded Path         | Path to Match        |
230	    +===================+======================+======================+
231	    | /foo/bar?baz=quz  | /foo/bar?baz=quz     | /foo/bar?baz=quz     |
232	    +-------------------+----------------------+----------------------+
233	    | /foo/bar?baz=http | /foo/bar?baz=http%3A | /foo/bar?baz=http%3A |
234	    | ://foo.bar        | %2F%2Ffoo.bar        | %2F%2Ffoo.bar        |
235	    +-------------------+----------------------+----------------------+
236	    | /foo/bar/U+E38384 | /foo/bar/%E3%83%84   | /foo/bar/%E3%83%84   |
237	    +-------------------+----------------------+----------------------+
238	    | /foo/             | /foo/bar/%E3%83%84   | /foo/bar/%E3%83%84   |
239	    | bar/%E3%83%84     |                      |                      |
240	    +-------------------+----------------------+----------------------+
241	    | /foo/             | /foo/bar/%62%61%7A   | /foo/bar/baz         |
242	    | bar/%62%61%7A     |                      |                      |
243	    +-------------------+----------------------+----------------------+

245	        Table 2: Examples of matching percent-encoded URI components

247	   The crawler SHOULD ignore "disallow" and "allow" rules that are not
248	   in any group (for example, any rule that precedes the first user-
249	   agent line).

251	   Implementers MAY bridge encoding mismatches if they detect that the
252	   robots.txt file is not UTF8 encoded.

254	2.2.3.  Special Characters

256	   Crawlers SHOULD allow the following special characters:

258	     +===========+===================+==============================+
259	     | Character | Description       | Example                      |
260	     +===========+===================+==============================+
261	     | "#"       | Designates an end | "allow: / # comment in line" |
262	     |           | of line comment.  |                              |
263	     |           |                   | "# comment on its own line"  |
264	     +-----------+-------------------+------------------------------+
265	     | "$"       | Designates the    | "allow: /this/path/exactly$" |
266	     |           | end of the match  |                              |
267	     |           | pattern.          |                              |
268	     +-----------+-------------------+------------------------------+
269	     | "*"       | Designates 0 or   | "allow: /this/*/exactly"     |
270	     |           | more instances of |                              |
271	     |           | any character.    |                              |
272	     +-----------+-------------------+------------------------------+

274	         Table 3: List of special characters in robots.txt files

276	   If crawlers match special characters verbatim in the URI, crawlers
277	   SHOULD use "%" encoding.  For example:

279	      +============================+===============================+
280	      | Percent-encoded Pattern    | URI                           |
281	      +============================+===============================+
282	      | /path/file-with-a-%2A.html | https://www.example.com/path/ |
283	      |                            | file-with-a-*.html            |
284	      +----------------------------+-------------------------------+
285	      | /path/foo-%24              | https://www.example.com/path/ |
286	      |                            | foo-$                         |
287	      +----------------------------+-------------------------------+

289	                   Table 4: Example of percent-encoding

291	2.2.4.  Other Records

293	   Clients MAY interpret other records that are not part of the
294	   robots.txt protocol.  For example, 'sitemap' [SITEMAPS].  Parsing of
295	   other records MUST NOT interfere with the parsing of explicitly
296	   defined records in section 2.

298	2.3.  Access Method

300	   The rules MUST be accessible in a file named "/robots.txt" (all lower
301	   case) in the top level path of the service.  The file MUST be UTF-8
302	   encoded (as defined in [RFC3629]) and Internet Media Type "text/
303	   plain" (as defined in [RFC2046]).

305	   As per [RFC3986], the URI of the robots.txt is:

307	   "scheme:[//authority]/robots.txt"

309	   For example, in the context of HTTP or FTP, the URI is:

311	             http://www.example.com/robots.txt

313	             https://www.example.com/robots.txt

315	             ftp://ftp.example.com/robots.txt

317	2.3.1.  Access Results

319	2.3.1.1.  Successful Access

321	   If the crawler successfully downloads the robots.txt, the crawler
322	   MUST follow the parseable rules.

324	2.3.1.2.  Redirects

326	   The server may respond to a robots.txt fetch request with a redirect,
327	   such as HTTP 301 and HTTP 302.  The crawlers SHOULD follow at least
328	   five consecutive redirects, even across authorities (for example,
329	   hosts in case of HTTP), as defined in [RFC1945].

331	   If a robots.txt file is reached within five consecutive redirects,
332	   the robots.txt file MUST be fetched, parsed, and its rules followed
333	   in the context of the initial authority.

335	   If there are more than five consecutive redirects, crawlers MAY
336	   assume that the robots.txt is unavailable.

338	2.3.1.3.  Unavailable Status

340	   Unavailable means the crawler tries to fetch the robots.txt, and the
341	   server responds with unavailable status codes.  For example, in the
342	   context of HTTP, unavailable status codes are in the 400-499 range.

344	   If a server status code indicates that the robots.txt file is
345	   unavailable to the client, then crawlers MAY access any resources on
346	   the server.

348	2.3.1.4.  Unreachable Status

350	   If the robots.txt is unreachable due to server or network errors,
351	   this means the robots.txt is undefined and the crawler MUST assume
352	   complete disallow.  For example, in the context of HTTP, an
353	   unreachable robots.txt has a response code in the 500-599 range.  For
354	   other undefined status codes, the crawler MUST assume the robots.txt
355	   is unreachable.

357	   If the robots.txt is undefined for a reasonably long period of time
358	   (for example, 30 days), clients MAY assume the robots.txt is
359	   unavailable or continue to use a cached copy.

361	2.3.1.5.  Parsing Errors

363	   Crawlers SHOULD try to parse each line of the robots.txt file.
364	   Crawlers MUST use the parseable rules.

366	2.4.  Caching

368	   Crawlers MAY cache the fetched robots.txt file's contents.  Crawlers
369	   MAY use standard cache control as defined in [RFC2616].  Crawlers
370	   SHOULD NOT use the cached version for more than 24 hours, unless the
371	   robots.txt is unreachable.

373	2.5.  Limits

375	   Crawlers MAY impose a parsing limit that MUST be at least 500
376	   kibibytes (KiB).

378	3.  Security Considerations

380	   The Robots Exclusion Protocol is not a substitute for more valid
381	   content security measures.  Listing URIs in the robots.txt file
382	   exposes the URI publicly and thus makes the URIs discoverable.

384	4.  IANA Considerations

386	   This document has no actions for IANA.

388	5.  Examples

390	5.1.  Simple Example

392	   The following example shows:

394	   *  foobot: A regular case.  A single user-agent token followed by
395	      rules.

397	   *  barbot and bazbot: A group that's relevant for more than one user-
398	      agent.

400	   *  quxbot: An empty group at end of the file.

402	             User-Agent : foobot
403	             Disallow : /example/page.html
404	             Disallow : /example/disallowed.gif

406	             User-Agent : barbot
407	             User-Agent : bazbot
408	             Allow : /example/page.html
409	             Disallow : /example/disallowed.gif

411	             User-Agent: quxbot

413	             EOF

415	5.2.  Longest Match

417	   The following example shows that in the case of two rules, the
418	   longest one is used for matching.  In the following case,
419	   /example/page/disallowed.gif MUST be used for the URI
420	   example.com/example/page/disallow.gif.

422	             User-Agent : foobot
423	             Allow : /example/page/
424	             Disallow : /example/page/disallowed.gif

426	6.  References

428	6.1.  Normative References

430	   [RFC1945]  Berners-Lee, T., Fielding, R., and H. Frystyk, "Hypertext
431	              Transfer Protocol -- HTTP/1.0", RFC 1945,
432	              DOI 10.17487/RFC1945, May 1996,
433	              <https://www.rfc-editor.org/info/rfc1945>.

435	   [RFC2046]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
436	              Extensions (MIME) Part Two: Media Types", RFC 2046,
437	              DOI 10.17487/RFC2046, November 1996,
438	              <https://www.rfc-editor.org/info/rfc2046>.

440	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
441	              Requirement Levels", BCP 14, RFC 2119,
442	              DOI 10.17487/RFC2119, March 1997,
443	              <https://www.rfc-editor.org/info/rfc2119>.

445	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
446	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
447	              Transfer Protocol -- HTTP/1.1", RFC 2616,
448	              DOI 10.17487/RFC2616, June 1999,
449	              <https://www.rfc-editor.org/info/rfc2616>.

451	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
452	              10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
453	              2003, <https://www.rfc-editor.org/info/rfc3629>.

455	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
456	              Resource Identifier (URI): Generic Syntax", STD 66,
457	              RFC 3986, DOI 10.17487/RFC3986, January 2005,
458	              <https://www.rfc-editor.org/info/rfc3986>.

460	   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
461	              Specifications: ABNF", STD 68, RFC 5234,
462	              DOI 10.17487/RFC5234, January 2008,
463	              <https://www.rfc-editor.org/info/rfc5234>.

465	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
466	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
467	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

469	   [RFC8288]  Nottingham, M., "Web Linking", RFC 8288,
470	              DOI 10.17487/RFC8288, October 2017,
471	              <https://www.rfc-editor.org/info/rfc8288>.

473	6.2.  Informative References

475	   [ROBOTSTXT]
476	              "Robots Exclusion Protocol", n.d.,
477	              <http://www.robotstxt.org/>.

479	   [SITEMAPS] "Sitemaps Protocol", n.d.,
480	              <https://www.sitemaps.org/index.html>.

482	Authors' Addresses

484	   Martijn Koster (editor)
485	   Stalworthy Computing, Ltd.
486	   Suton Lane
487	   Wymondham, Norfolk
488	   NR18 9JG
489	   United Kingdom
490	   Email: m.koster@greenhills.co.uk
491	   Gary Illyes (editor)
492	   Google LLC.
493	   Brandschenkestrasse 110
494	   CH-8002 Zurich
495	   Switzerland
496	   Email: garyillyes@google.com

498	   Henner Zeller (editor)
499	   Google LLC.
500	   1600 Amphitheatre Pkwy
501	   Mountain View, CA,  94043
502	   United States of America
503	   Email: henner@google.com

505	   Lizzi Sassman (editor)
506	   Google LLC.
507	   Brandschenkestrasse 110
508	   CH-8002 Zurich
509	   Switzerland
510	   Email: lizzi@google.com