idnits 2.17.1 

draft-koster-rep-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  -- The document has an IETF Trust Provisions (28 Dec 2009) Section 6.c(i)
     Publication Limitation clause.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([1]), which it shouldn't. 
     Please replace those with straight textual mentions of the documents in
     question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 05, 2021) is 1049 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '1' on line 424

  -- Looks like a reference, but probably isn't: '2' on line 426

  -- Looks like a reference, but probably isn't: '3' on line 428

  -- Looks like a reference, but probably isn't: '4' on line 430

  -- Looks like a reference, but probably isn't: '5' on line 432

  -- Looks like a reference, but probably isn't: '6' on line 434

  -- Looks like a reference, but probably isn't: '7' on line 436

  -- Looks like a reference, but probably isn't: '8' on line 438

  -- Looks like a reference, but probably isn't: '9' on line 440

  -- Duplicate reference: RFC2119, mentioned in 'RFC8174', was also mentioned
     in 'RFC2119'.


     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 13 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                                          M. Koster
2	Internet-Draft                                Stalworthy Computing, Ltd.
3	Intended status: Informational                                 G. Illyes
4	Expires: December 2, 2021                                      H. Zeller
5	                                                               L. Harvey
6	                                                                  Google
7	                                                           June 05, 2021

9	                       Robots Exclusion Protocol
10	                         draft-koster-rep-05

12	Abstract

14	   This document specifies and extends the "Robots Exclusion
15	   Protocol" [1] method originally defined by Martijn Koster in 1996 for
16	   service owners to control how content served by their services may be
17	   accessed, if at all, by automatic clients known as crawlers.

19	Status of This Memo

21	   This Internet-Draft is submitted in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF).  Note that other groups may also distribute
26	   working documents as Internet-Drafts.  The list of current Internet-
27	   Drafts is at http://datatracker.ietf.org/drafts/current/.

29	   Internet-Drafts are draft documents valid for a maximum of six months
30	   and may be updated, replaced, or obsoleted by other documents at any
31	   time.  It is inappropriate to use Internet-Drafts as reference
32	   material or to cite them other than as "work in progress."

34	   This document may not be modified, and derivative works of it may not
35	   be created, except to format it for publication as an RFC or to
36	   translate it into languages other than English.

38	   This Internet-Draft will expire on December 2, 2021.

40	Copyright Notice

42	   Copyright (c) 2020 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (http://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document.  Please review these documents
49	   carefully, as they describe your rights and restrictions with respect
50	   to this document.  Code Components extracted from this document must
51	   include Simplified BSD License text as described in Section 4.e of
52	   the Trust Legal Provisions and are provided without warranty as
53	   described in the Simplified BSD License.

55	Table of Contents

57	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
58	     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   2
59	   2.  Specification . . . . . . . . . . . . . . . . . . . . . . . .   3
60	     2.1.  Protocol definition . . . . . . . . . . . . . . . . . . .   3
61	     2.2.  Formal syntax . . . . . . . . . . . . . . . . . . . . . .   3
62	       2.2.1.  The user-agent line . . . . . . . . . . . . . . . . .   4
63	       2.2.2.  The Allow and Disallow lines  . . . . . . . . . . . .   4
64	       2.2.3.  Special characters  . . . . . . . . . . . . . . . . .   5
65	       2.2.4.  Other records . . . . . . . . . . . . . . . . . . . .   6
66	     2.3.  Access method . . . . . . . . . . . . . . . . . . . . . .   6
67	       2.3.1.  Access results  . . . . . . . . . . . . . . . . . . .   7
68	     2.4.  Caching . . . . . . . . . . . . . . . . . . . . . . . . .   8
69	     2.5.  Limits  . . . . . . . . . . . . . . . . . . . . . . . . .   8
70	     2.6.  Security Considerations . . . . . . . . . . . . . . . . .   8
71	     2.7.  IANA Considerations . . . . . . . . . . . . . . . . . . .   8
72	   3.  Examples  . . . . . . . . . . . . . . . . . . . . . . . . . .   8
73	     3.1.  Simple example  . . . . . . . . . . . . . . . . . . . . .   8
74	     3.2.  Longest Match . . . . . . . . . . . . . . . . . . . . . .   9
75	   4.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
76	     4.1.  Normative References  . . . . . . . . . . . . . . . . . .   9
77	     4.2.  URIs  . . . . . . . . . . . . . . . . . . . . . . . . . .   9
78	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  10

80	1.  Introduction

82	   This document applies to services that provide resources that clients
83	   can access through URIs as defined in RFC3986 [2].  For example, in
84	   the context of HTTP, a browser is a client that displays the content
85	   of a web page.

87	   Crawlers are automated clients.  Search engines for instance have
88	   crawlers to recursively traverse links for indexing as defined in
89	   RFC8288 [3].

91	   It may be inconvenient for service owners if crawlers visit the
92	   entirety of their URI space.  This document specifies the rules that
93	   crawlers are expected to obey when accessing URIs.

95	   These rules are not a form of access authorization.

97	1.1.  Terminology

99	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
100	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
101	   "OPTIONAL" in this document are to be interpreted as described in
102	   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
103	   capitals, as shown here.

105	2.  Specification

107	2.1.  Protocol definition

109	   The protocol language consists of rule(s) and group(s) that the
110	   service makes available in a file named 'robots.txt' as described in
111	   section 2.3:

113	   o  *Rule*: A line with a key-value pair that defines how a crawler
114	      may access URIs. See section 2.2.2.

116	   o  *Group*: One or more user-agent lines that is followed by one or
117	      more rules. The group is terminated by a user-agent line or end
118	      of file. See 2.2.1. The last group may have no rules, which means
119	      it implicitly allows everything.

121	2.2.  Formal syntax

123	   Below is an Augmented Backus-Naur Form (ABNF) description, as
124	   described in RFC5234 [4].

126	   robotstxt = *(group / emptyline)
127	   group = startgroupline                ; We start with a user-agent
128	           *(startgroupline / emptyline) ; ... and possibly more
129	                                         ; user-agents
130	           *(rule / emptyline)           ; followed by rules relevant
131	                                         ; for UAs

133	   startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL

135	   rule = *WS ("allow" / "disallow") *WS ":"
136	          *WS (path-pattern / empty-pattern) EOL

138	   ; parser implementors: add additional lines you need (for
139	   ; example Sitemaps), and be lenient when reading lines that don't
140	   ; conform. Apply Postel's law.

142	   product-token = identifier / "*"
143	   path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern
144	   empty-pattern = *WS

146	   identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A)
147	   comment = "#" *(UTF8-char-noctl / WS / "#")
148	   emptyline = EOL
149	   EOL = *WS [comment] NL ; end-of-line may have
150	                          ; optional trailing comment
151	   NL = %x0D / %x0A / %x0D.0A
152	   WS = %x20 / %x09
153	   ; UTF8 derived from RFC3629, but excluding control characters

155	   UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
156	   UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
157	   UTF8-2 = %xC2-DF UTF8-tail
158	   UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
159	            %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
160	   UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
161	            %xF4 %x80-8F 2UTF8-tail

163	   UTF8-tail = %x80-BF

165	2.2.1.  The user-agent line

167	   Crawlers set a product token to find relevant groups.  The product
168	   token MUST contain only "a-zA-Z_-" characters.  The product token
169	   SHOULD be part of the identification string that the crawler sends
170	   to the service (for example, in the case of HTTP, the product name
171	   SHOULD be in the user-agent header).  The identification string
172	   SHOULD describe the purpose of the crawler.  Here's an example of an
173	   HTTP header with a link pointing to a page describing the purpose of
174	   the ExampleBot crawler which appears both in the HTTP header and as a
175	   product token:

177	   +-------------------------------------------------+-----------------+
178	   | HTTP header                                     | robots.txt      |
179	   |                                                 | user-agent line |
180	   +-------------------------------------------------+-----------------+
181	   | user-agent: Mozilla/5.0 (compatible;            | user-agent:     |
182	   | ExampleBot/0.1;                                 | ExampleBot      |
183	   | https://www.example.com/bot.html)               |                 |
184	   +-------------------------------------------------+-----------------+

186	   Crawlers MUST find the group that matches the product token exactly,
187	   and then obey the rules of the group.  If there is more than one
188	   group matching the user-agent, the matching groups' rules MUST be
189	   combined into one group.  The matching MUST be case-insensitive.  If
190	   no matching group exists, crawlers MUST obey the first group with a
191	   user-agent line with a "*" value, if present.  If no group satisfies
192	   either condition, or no groups are present at all, no rules apply.

194	2.2.2.  The Allow and Disallow lines

196	   These lines indicate whether accessing a URI that matches the
197	   corresponding path is allowed or disallowed.

199	   To evaluate if access to a URI is allowed, a robot MUST match the
200	   paths in allow and disallow rules against the URI.  The matching
201	   SHOULD be case sensitive.  The most specific match found MUST be
202	   used.  The most specific match is the match that has the most octets.
203	   If an allow and disallow rule is equivalent, the allow SHOULD be
204	   used.  If no match is found amongst the rules in a group for a
205	   matching user-agent, or there are no rules in the group, the URI is
206	   allowed.  The /robots.txt URI is implicitly allowed.

208	   Octets in the URI and robots.txt paths outside the range of the US-
209	   ASCII coded character set, and those in the reserved range defined by
210	   RFC3986 [2], MUST be percent-encoded as defined by RFC3986 [1] prior
211	   to comparison.

213	   If a percent-encoded US-ASCII octet is encountered in the URI, it
214	   MUST be unencoded prior to comparison, unless it is a reserved
215	   character in the URI as defined by RFC3986 [2] or the character is
216	   outside the unreserved character range.  The match evaluates
217	   positively if and only if the end of the path from the rule is
218	   reached before a difference in octets is encountered.

220	   For example:

222	   +-------------------+-----------------------+-----------------------+
223	   | Path              | Encoded Path          | Path to match         |
224	   +-------------------+-----------------------+-----------------------+
225	   | /foo/bar?baz=quz  | /foo/bar?baz=quz      | /foo/bar?baz=quz      |
226	   |                   |                       |                       |
227	   | /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% |
228	   | ://foo.bar        | 2F%2Ffoo.bar          | 2F%2Ffoo.bar          |
229	   |                   |                       |                       |
230	   | /foo/bar/U+E38384 | /foo/bar/%E3%83%84    | /foo/bar/%E3%83%84    |
231	   |                   |                       |                       |
232	   | /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84    | /foo/bar/%E3%83%84    |
233	   | 4                 |                       |                       |
234	   |                   |                       |                       |
235	   | /foo/bar/%62%61%7 | /foo/bar/%62%61%7A    | /foo/bar/baz          |
236	   | A                 |                       |                       |
237	   +-------------------+-----------------------+-----------------------+

239	   The crawler SHOULD ignore "disallow" and "allow" rules that are not
240	   in any group (for example, any rule that precedes the first user-
241	   agent line).

243	   Implementers MAY bridge encoding mismatches if they detect that the
244	   robots.txt file is not UTF8 encoded.

246	2.2.3.  Special characters

248	   Crawlers SHOULD allow the following special characters:

250	   +-----------+--------------------------------+----------------------+
251	   | Character | Description                    | Example              |
252	   +-----------+--------------------------------+----------------------+
253	   | "#"       | Designates an end of line      | "allow: / # comment  |
254	   |           | comment.                       | in line"             |
255	   |           |                                |                      |
256	   |           |                                | "# comment at the    |
257	   |           |                                | end"                 |
258	   |           |                                |                      |
259	   | "$"       | Designates the end of the      | "allow:              |
260	   |           | match pattern. A URI MUST end  | /this/path/exactly$" |
261	   |           | with a $.                      |                      |
262	   |           |                                |                      |
263	   | "*"       | Designates 0 or more instances | "allow:              |
264	   |           | of any character.              | /this/*/exactly"     |
265	   +-----------+--------------------------------+----------------------+

267	   If crawlers match special characters verbatim in the URI, crawlers
268	   SHOULD use "%" encoding.  For example:

270	   +------------------------+------------------------------------------+
271	   | Pattern                | URI                                      |
272	   +------------------------+------------------------------------------+
273	   | /path/file-            | https://www.example.com/path/file-       |
274	   | with-a-%2A.html        | with-a-*.html                            |
275	   |                        |                                          |
276	   | /path/foo-%24          | https://www.example.com/path/foo-$       |
277	   +------------------------+------------------------------------------+

279	2.2.4.  Other records

281	   Clients MAY interpret other records that are not part of the
282	   robots.txt protocol.  For example, 'sitemap' [5].

284	2.3.  Access method

286	   The rules MUST be accessible in a file named "/robots.txt" (all lower
287	   case) in the top level path of the service.  The file MUST be UTF-8
288	   encoded (as defined in RFC3629 [6]) and Internet Media Type "text/
289	   plain" (as defined in RFC2046 [7]).

291	   As per RFC3986 [2], the URI of the robots.txt is:

293	   "scheme:[//authority]/robots.txt"

295	   For example, in the context of HTTP or FTP, the URI is:

297	       http://www.example.com/robots.txt
298	       https://www.example.com/robots.txt

300	       ftp://ftp.example.com/robots.txt

302	2.3.1.  Access results

304	2.3.1.1.  Successful access

306	   If the crawler successfully downloads the robots.txt, the crawler
307	   MUST follow the parseable rules.

309	2.3.1.2.  Redirects

311	   The server may respond to a robots.txt fetch request with a redirect,
312	   such as HTTP 301 and HTTP 302.  The crawlers SHOULD follow at least
313	   five consecutive redirects, even across authorities (for example
314	   hosts in case of HTTP), as defined in RFC1945 [8].

316	   If a robots.txt file is reached within five consecutive redirects,
317	   the robots.txt file MUST be fetched, parsed, and its rules followed
318	   in the context of the initial authority.

320	   If there are more than five consecutive redirects, crawlers MAY
321	   assume that the robots.txt is unavailable.

323	2.3.1.3.  Unavailable status

325	   Unavailable means the crawler tries to fetch the robots.txt, and the
326	   server responds with unavailable status codes.  For example, in the
327	   context of HTTP, unavailable status codes are in the 400-499 range.

329	   If a server status code indicates that the robots.txt file is
330	   unavailable to the client, then crawlers MAY access any resources on
331	   the server or MAY use a cached version of a robots.txt file for up to
332	   24 hours.

334	2.3.1.4.  Unreachable status

336	   If the robots.txt is unreachable due to server or network errors,
337	   this means the robots.txt is undefined and the crawler MUST assume
338	   complete disallow.  For example, in the context of HTTP, an
339	   unreachable robots.txt has a response code in the 500-599 range.  For
340	   other undefined status codes, the crawler MUST assume the robots.txt
341	   is unreachable.

343	   If the robots.txt is undefined for a reasonably long period of time
344	   (for example, 30 days), clients MAY assume the robots.txt is
345	   unavailable or continue to use a cached copy.

347	2.3.1.5.  Parsing errors

349	   Crawlers SHOULD try to parse each line of the robots.txt file.
350	   Crawlers MUST use the parseable rules.

352	2.4.  Caching

354	   Crawlers MAY cache the fetched robots.txt file's contents.  Crawlers
355	   MAY use standard cache control as defined in RFC2616 [9].  Crawlers
356	   SHOULD NOT use the cached version for more than 24 hours, unless the
357	   robots.txt is unreachable.

359	2.5.  Limits

361	   Crawlers MAY impose a parsing limit that MUST be at least 500
362	   kibibytes (KiB).

364	2.6.  Security Considerations

366	   The Robots Exclusion Protocol is not a substitute for more valid
367	   content security measures. Listing URIs in the robots.txt file
368	   exposes the URI publicly and thus making the URIs discoverable.

370	2.7.  IANA Considerations.

372	   This document has no actions for IANA.

374	3.  Examples

376	3.1.  Simple example

378	   The following example shows:

380	   o  *foobot*: A regular case.  A single user-agent token followed by
381	      rules.
382	   o  *barbot and bazbot*: A group that's relevant for more than one
383	      user-agent.
384	   o  *quxbot:* Empty group at end of file.

386	   <CODE BEGINS>
387	   User-Agent : foobot
388	   Disallow : /example/page.html
389	   Disallow : /example/disallowed.gif

391	   User-Agent : barbot
392	   User-Agent : bazbot
393	   Allow : /example/page.html
394	   Disallow : /example/disallowed.gif

396	   User-Agent: quxbot

398	   EOF
399	   <CODE ENDS>

401	3.2.  Longest Match

403	   The following example shows that in the case of two rules, the
404	   longest one is used for matching.  In the following case,
405	   /example/page/disallowed.gif MUST be used for the URI
406	   example.com/example/page/disallow.gif .

408	   <CODE BEGINS>
409	   User-Agent : foobot
410	   Allow : /example/page/
411	   Disallow : /example/page/disallowed.gif
412	   <CODE ENDS>

414	4.  References

416	4.1.  Normative References

418	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
419	              Requirement Levels", BCP 14, RFC 2119, March 1997.
420	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in
421	              RFC 2119 Key Words", BCP 14, RFC 2119, May 2017.

423	4.2.  URIs
424	   [1] http://www.robotstxt.org/

426	   [2] https://tools.ietf.org/html/rfc3986

428	   [3] https://tools.ietf.org/html/rfc8288

430	   [4] https://tools.ietf.org/html/rfc5234

432	   [5] https://www.sitemaps.org/index.html

434	   [6] https://tools.ietf.org/html/rfc3629

436	   [7] https://tools.ietf.org/html/rfc2046

438	   [8] https://tools.ietf.org/html/rfc1945

440	   [9] https://tools.ietf.org/html/rfc2616

442	Authors' Address

444	   Martijn Koster
445	   Stalworthy Manor Farm
446	   Suton Lane, NR18 9JG
447	   Wymondham, Norfolk
448	   United Kingdom
449	   Email: m.koster@greenhills.co.uk

451	   Gary Illyes
452	   Brandschenkestrasse 110
453	   8002, Zurich
454	   Switzerland
455	   Email: garyillyes@google.com

457	   Henner Zeller
458	   1600 Amphitheatre Pkwy
459	   Mountain View, CA 94043
460	   USA
461	   Email: henner@google.com

463	   Lizzi Harvey
464	   1600 Amphitheatre Pkwy
465	   Mountain View, CA 94043
466	   USA
467	   Email: lizzi@google.com