idnits 2.17.1 

draft-koster-rep-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  -- The document has an IETF Trust Provisions (28 Dec 2009) Section 6.c(i)
     Publication Limitation clause.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 10 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 2 instances of too long lines in the document, the longest one
     being 3 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (January 08, 2020) is 1563 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Draft Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '1' on line 423

  -- Looks like a reference, but probably isn't: '2' on line 425

  -- Looks like a reference, but probably isn't: '3' on line 427

  -- Looks like a reference, but probably isn't: '4' on line 429

  -- Looks like a reference, but probably isn't: '5' on line 431

  -- Looks like a reference, but probably isn't: '6' on line 433

  -- Looks like a reference, but probably isn't: '7' on line 435

  -- Looks like a reference, but probably isn't: '8' on line 437

  -- Duplicate reference: RFC2119, mentioned in 'RFC8174', was also mentioned
     in 'RFC2119'.


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                                          M. Koster
2	Internet-Draft                                Stalworthy Computing, Ltd.
3	Intended status: Draft Standard                                G. Illyes
4	Expires: June 9, 2020                                       H. Zeller
5	                                                               L. Harvey
6	                                                                  Google
7	                                                           January 08, 2020

9	                       Robots Exclusion Protocol
10	                         draft-koster-rep-01

12	Abstract

14	   This document standardizes and extends the "Robots Exclusion
15	   Protocol" <http://www.robotstxt.org/> method originally defined by
16	   Martijn Koster in 1996 for service owners to control how content
17	   served by their services may be accessed, if at all, by automatic
18	   clients known as crawlers.

20	Status of This Memo

22	   This Internet-Draft is submitted in full conformance with the
23	   provisions of BCP 78 and BCP 79.

25	   Internet-Drafts are working documents of the Internet Engineering
26	   Task Force (IETF).  Note that other groups may also distribute
27	   working documents as Internet-Drafts.  The list of current Internet-
28	   Drafts is at http://datatracker.ietf.org/drafts/current/.

30	   Internet-Drafts are draft documents valid for a maximum of six months
31	   and may be updated, replaced, or obsoleted by other documents at any
32	   time.  It is inappropriate to use Internet-Drafts as reference
33	   material or to cite them other than as "work in progress."

35	   This document may not be modified, and derivative works of it may not
36	   be created, except to format it for publication as an RFC or to
37	   translate it into languages other than English.

39	   This Internet-Draft will expire on June 9, 2020.

41	Copyright Notice

43	   Copyright (c) 2019 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents
48	   (http://trustee.ietf.org/license-info) in effect on the date of
49	   publication of this document.  Please review these documents
50	   carefully, as they describe your rights and restrictions with respect
51	   to this document.  Code Components extracted from this document must
52	   include Simplified BSD License text as described in Section 4.e of
53	   the Trust Legal Provisions and are provided without warranty as
54	   described in the Simplified BSD License.

56	Table of Contents

58	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
59	     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   2
60	   2.  Specification . . . . . . . . . . . . . . . . . . . . . . . .   3
61	     2.1.  Protocol definition . . . . . . . . . . . . . . . . . . .   3
62	     2.2.  Formal syntax . . . . . . . . . . . . . . . . . . . . . .   3
63	       2.2.1.  The user-agent line . . . . . . . . . . . . . . . . .   4
64	       2.2.2.  The Allow and Disallow lines  . . . . . . . . . . . .   4
65	       2.2.3.  Special characters  . . . . . . . . . . . . . . . . .   5
66	       2.2.4.  Other records . . . . . . . . . . . . . . . . . . . .   6
67	     2.3.  Access method . . . . . . . . . . . . . . . . . . . . . .   6
68	       2.3.1.  Access results  . . . . . . . . . . . . . . . . . . .   7
69	     2.4.  Caching . . . . . . . . . . . . . . . . . . . . . . . . .   8
70	     2.5.  Limits  . . . . . . . . . . . . . . . . . . . . . . . . .   8
71	     2.6.  Security Considerations . . . . . . . . . . . . . . . . .   8
72	     2.7.  IANA Considerations . . . . . . . . . . . . . . . . . . .   8
73	   3.  Examples  . . . . . . . . . . . . . . . . . . . . . . . . . .   8
74	     3.1.  Simple example  . . . . . . . . . . . . . . . . . . . . .   8
75	     3.2.  Longest Match . . . . . . . . . . . . . . . . . . . . . .   9
76	   4.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
77	     4.1.  Normative References  . . . . . . . . . . . . . . . . . .   9
78	     4.2.  URIs  . . . . . . . . . . . . . . . . . . . . . . . . . .   9
79	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .   10

81	1.  Introduction

83	   This document applies to services that provide resources that clients
84	   can access through URIs as defined in RFC3986 [1].  For example, in
85	   the context of HTTP, a browser is a client that displays the content
86	   of a web page.

88	   Crawlers are automated clients.  Search engines for instance have
89	   crawlers to recursively traverse links for indexing as defined in
90	   RFC8288 [2].

92	   It may be inconvenient for service owners if crawlers visit the
93	   entirety of their URI space.  This document specifies the rules that
94	   crawlers MUST obey when accessing URIs.

96	   These rules are not a form of access authorization.

98	1.1.  Terminology

100	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
101	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
102	   "OPTIONAL" in this document are to be interpreted as described in
103	   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
104	   capitals, as shown here.

106	2.  Specification

108	2.1.  Protocol definition

110	   The protocol language consists of rule(s) and group(s):

112	   o  *Rule*: A line with a key-value pair that defines how a crawler
113	      may access URIs. See section The Allow and Disallow lines.

115	   o  *Group*: One or more user-agent lines that is followed by one or
116	      more rules. The group is terminated by a user-agent line or end
117	      of file. See User-agent line. The last group may have no rules,
118	      which means it implicitly allows everything.

120	2.2.  Formal syntax

122	   Below is an Augmented Backus-Naur Form (ABNF) description, as
123	   described in RFC5234 [3].

125	   robotstxt = *(group / emptyline)
126	   group = startgroupline                ; We start with a user-agent
127	           *(startgroupline / emptyline) ; ... and possibly more
128	                                         ; user-agents
129	           *(rule / emptyline)           ; followed by rules relevant
130	                                         ; for UAs

132	   startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL

134	   rule = *WS ("allow" / "disallow") *WS ":"
135	          *WS (path-pattern / empty-pattern) EOL

137	   ; parser implementors: add additional lines you need (for
138	   ; example Sitemaps), and be lenient when reading lines that don't
139	   ; conform. Apply Postel's law.

141	   product-token = identifier / "*"
142	   path-pattern = "/" *(UTF8-char-noctl) ; valid URI path pattern
143	   empty-pattern = *WS

145	   identifier = 1*(%x2d / %x41-5a / %x5f / %x61-7a)
146	   comment = "#"*(UTF8-char-noctl / WS / "#")
147	   emptyline = EOL EOL = *WS [comment] NL ; end-of-line may have
148	                                          ; optional trailing comment
149	   NL = %x0D / %x0A / %x0D.0A
150	   WS = %x20 / %x09
151	   ; UTF8 derived from RFC3629, but excluding control characters

153	   UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
154	   UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
155	   UTF8-2 = %xC2-DF UTF8-tail
156	   UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
157	            %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
158	   UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
159	            %xF4 %x80-8F 2( UTF8-tail )

161	   UTF8-tail = %x80-BF

163	2.2.1.  The user-agent line

165	   Crawlers set a product token to find relevant groups.  The product
166	   token MUST contain only "a-zA-Z_-" characters.  The product token
167	   SHOULD be part of the identification string that the crawler sends
168	   to the service (for example, in the case of HTTP, the product name
169	   SHOULD be in the user-agent header).  The identification string
170	   SHOULD describe the purpose of the crawler.  Here's an example of an
171	   HTTP header with a link pointing to a page describing the purpose of
172	   the ExampleBot crawler which appears both in the HTTP header and as a
173	   product token:

175	   +-------------------------------------------------+-----------------+
176	   | HTTP header                                     | robots.txt      |
177	   |                                                 | user-agent line |
178	   +-------------------------------------------------+-----------------+
179	   | user-agent: Mozilla/5.0 (compatible;            | user-agent:     |
180	   | ExampleBot/0.1;                                 | ExampleBot      |
181	   | https://www.example.com/bot.html)               |                 |
182	   +-------------------------------------------------+-----------------+

184	   Crawlers MUST find the group that matches the product token exactly,
185	   and then obey the rules of the group.  If there is more than one
186	   group matching the user-agent, the matching groups' rules MUST be
187	   combined into one group.  The matching MUST be case-insensitive.  If
188	   no matching group exists, crawlers MUST obey the first group with a
189	   user-agent line with a "*" value, if present.  If no group satisfies
190	   either condition, or no groups are present at all, no rules apply.

192	2.2.2.  The Allow and Disallow lines

194	   These lines indicate whether accessing a URI that matches the
195	   corresponding path is allowed or disallowed.

197	   To evaluate if access to a URI is allowed, a robot MUST match the
198	   paths in allow and disallow rules against the URI.  The matching
199	   SHOULD be case sensitive.  The most specific match found MUST be
200	   used.  The most specific match is the match that has the most octets.
201	   If an allow and disallow rule is equivalent, the allow SHOULD be
202	   used.  If no match is found amongst the rules in a group for a
203	   matching user-agent, or there are no rules in the group, the URI is
204	   allowed.  The /robots.txt URI is implicitly allowed.

206	   Octets in the URI and robots.txt paths outside the range of the US-
207	   ASCII coded character set, and those in the reserved range defined by
208	   RFC3986 [1], MUST be percent-encoded as defined by RFC3986 [1] prior
209	   to comparison.

211	   If a percent-encoded US-ASCII octet is encountered in the URI, it
212	   MUST be unencoded prior to comparison, unless it is a reserved
213	   character in the URI as defined by RFC3986 [1] or the character is
214	   outside the unreserved character range.  The match evaluates
215	   positively if and only if the end of the path from the rule is
216	   reached before a difference in octets is encountered.

218	   For example:

220	   +-------------------+-----------------------+-----------------------+
221	   | Path              | Encoded Path          | Path to match         |
222	   +-------------------+-----------------------+-----------------------+
223	   | /foo/bar?baz=quz  | /foo/bar?baz=quz      | /foo/bar?baz=quz      |
224	   |                   |                       |                       |
225	   | /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% |
226	   | ://foo.bar        | 2F%2Ffoo.bar          | 2F%2Ffoo.bar          |
227	   |                   |                       |                       |
228	   | /foo/bar/U+E38384 | /foo/bar/%E3%83%84    | /foo/bar/%E3%83%84    |
229	   |                   |                       |                       |
230	   | /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84    | /foo/bar/%E3%83%84    |
231	   | 4                 |                       |                       |
232	   |                   |                       |                       |
233	   | /foo/bar/%62%61%7 | /foo/bar/%62%61%7A    | /foo/bar/baz          |
234	   | A                 |                       |                       |
235	   +-------------------+-----------------------+-----------------------+

237	   The crawler SHOULD ignore "disallow" and "allow" rules that are not
238	   in any group (for example, any rule that precedes the first user-
239	   agent line).

241	   Implementers MAY bridge encoding mismatches if they detect that the
242	   robots.txt file is not UTF8 encoded.

244	2.2.3.  Special characters

246	   Crawlers SHOULD allow the following special characters:

248	   +-----------+--------------------------------+----------------------+
249	   | Character | Description                    | Example              |
250	   +-----------+--------------------------------+----------------------+
251	   | "#"       | Designates an end of line      | "allow: / # comment  |
252	   |           | comment.                       | in line"             |
253	   |           |                                |                      |
254	   |           |                                | "# comment at the    |
255	   |           |                                | end"                 |
256	   |           |                                |                      |
257	   | "$"       | Designates the end of the      | "allow:              |
258	   |           | match pattern. A URI MUST end  | /this/path/exactly$" |
259	   |           | with a $.                      |                      |
260	   |           |                                |                      |
261	   | "*"       | Designates 0 or more instances | "allow:              |
262	   |           | of any character.              | /this/*/exactly"     |
263	   +-----------+--------------------------------+----------------------+

265	   If crawlers match special characters verbatim in the URI, crawlers
266	   SHOULD use "%" encoding.  For example:

268	   +------------------------+------------------------------------------+
269	   | Pattern                | URI                                      |
270	   +------------------------+------------------------------------------+
271	   | /path/file-            | https://www.example.com/path/file-       |
272	   | with-a-%2A.html        | with-a-*.html                            |
273	   |                        |                                          |
274	   | /path/foo-%24          | https://www.example.com/path/foo-$       |
275	   +------------------------+------------------------------------------+

277	2.2.4.  Other records

279	   Clients MAY interpret other records that are not part of the
280	   robots.txt protocol.  For example, 'sitemap' [4].

282	2.3.  Access method

284	   The rules MUST be accessible in a file named "/robots.txt" (all lower
285	   case) in the top level path of the service.  The file MUST be UTF-8
286	   encoded (as defined in RFC3629 [5]) and Internet Media Type "text/
287	   plain" (as defined in RFC2046 [6]).

289	   As per RFC3986 [1], the URI of the robots.txt is:

291	   "scheme:[//authority]/robots.txt"

293	   For example, in the context of HTTP or FTP, the URI is:

295	       http://www.example.com/robots.txt
296	       https://www.example.com/robots.txt

298	       ftp://ftp.example.com/robots.txt

300	2.3.1.  Access results

302	2.3.1.1.  Successful access

304	   If the crawler successfully downloads the robots.txt, the crawler
305	   MUST follow the parseable rules.

307	2.3.1.2.  Redirects

309	   The server may respond to a robots.txt fetch request with a redirect,
310	   such as HTTP 301 and HTTP 302.  The crawlers SHOULD follow at least
311	   five consecutive redirects, even across authorities (for example
312	   hosts in case of HTTP), as defined in RFC1945 [7].

314	   If a robots.txt file is reached within five consecutive redirects,
315	   the robots.txt file MUST be fetched, parsed, and its rules followed
316	   in the context of the initial authority.

318	   If there are more than five consecutive redirects, crawlers MAY
319	   assume that the robots.txt is unavailable.

321	2.3.1.3.  Unavailable status

323	   Unavailable means the crawler tries to fetch the robots.txt, and the
324	   server responds with unavailable status codes.  For example, in the
325	   context of HTTP, unavailable status codes are in the 400-499 range.

327	   If a server status code indicates that the robots.txt file is
328	   unavailable to the client, then crawlers MAY access any resources on
329	   the server or MAY use a cached version of a robots.txt file for up to
330	   24 hours.

332	2.3.1.4.  Unreachable status

334	   If the robots.txt is unreachable due to server or network errors,
335	   this means the robots.txt is undefined and the crawler MUST assume
336	   complete disallow.  For example, in the context of HTTP, an
337	   unreachable robots.txt has a response code in the 500-599 range.  For
338	   other undefined status codes, the crawler MUST assume the robots.txt
339	   is unreachable.

341	   If the robots.txt is undefined for a reasonably long period of time
342	   (for example, 30 days), clients MAY assume the robots.txt is
343	   unavailable or continue to use a cached copy.

345	2.3.1.5.  Parsing errors

347	   Crawlers SHOULD try to parse each line of the robots.txt file.
348	   Crawlers MUST use the parseable rules.

350	2.4.  Caching

352	   Crawlers MAY cache the fetched robots.txt file's contents.  Crawlers
353	   MAY use standard cache control as defined in RFC2616 [8].  Crawlers
354	   SHOULD NOT use the cached version for more than 24 hours, unless the
355	   robots.txt is unreachable.

357	2.5.  Limits

359	   Crawlers MAY impose a parsing limit that MUST be at least 500
360	   kibibytes (KiB).

362	2.6.  Security Considerations

364	   The Robots Exclusion Protocol MUST NOT be used as a form of security
365	   measures. Listing URIs in the robots.txt file exposes the URI
366	   publicly and thus making the URIs discoverable.

368	2.7.  IANA Considerations.

370	   This document has no actions for IANA.

372	3.  Examples

374	3.1.  Simple example

376	   The following example shows:

378	   o  *foobot*: A regular case.  A single user-agent token followed by
379	      rules.
380	   o  *barbot and bazbot*: A group that's relevant for more than one
381	      user-agent.
382	   o  *quxbot:* Empty group at end of file.

384	   <CODE BEGINS>
385	   User-Agent : foobot
386	   Disallow : /example/page.html
387	   Disallow : /example/disallowed.gif

389	   User-Agent : barbot
390	   User-Agent : bazbot
391	   Allow : /example/page.html
392	   Disallow : /example/disallowed.gif

394	   User-Agent: quxbot

396	   EOF
397	   <CODE ENDS>

399	3.2.  Longest Match

401	   The following example shows that in the case of a two rules, the
402	   longest one MUST be used for matching.  In the following case,
403	   /example/page/disallowed.gif MUST be used for the URI
404	   example.com/example/page/disallow.gif .

406	   <CODE BEGINS>
407	   User-Agent : foobot
408	   Allow : /example/page/
409	   Disallow : /example/page/disallowed.gif
410	   <CODE ENDS>

412	4.  References

414	4.1.  Normative References

416	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
417	              Requirement Levels", BCP 14, RFC 2119, March 1997.
418	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in
419	              RFC 2119 Key Words", BCP 14, RFC 2119, May 2017.

421	4.2.  URIs

423	   [1] https://tools.ietf.org/html/rfc3986

425	   [2] https://tools.ietf.org/html/rfc8288

427	   [3] https://tools.ietf.org/html/rfc5234

429	   [4] https://www.sitemaps.org/index.html

431	   [5] https://tools.ietf.org/html/rfc3629

433	   [6] https://tools.ietf.org/html/rfc2046

435	   [7] https://tools.ietf.org/html/rfc1945

437	   [8] https://tools.ietf.org/html/rfc2616

439	Authors' Address

441	   Martijn Koster
442	   Stalworthy Manor Farm
443	   Suton Lane, NR18 9JG
444	   Wymondham, Norfolk
445	   United Kingdom
446	   Email: m.koster@greenhills.co.uk

448	   Gary Illyes
449	   Brandschenkestrasse 110
450	   8002, Zurich
451	   Switzerland
452	   Email: garyillyes@google.com

454	   Henner Zeller
455	   1600 Amphitheatre Pkwy
456	   Mountain View, CA 94043
457	   USA
458	   Email: henner@google.com

460	   Lizzi Harvey
461	   1600 Amphitheatre Pkwy
462	   Mountain View, CA 94043
463	   USA
464	   Email: lizzi@google.com