idnits 2.17.1 draft-koster-rep-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- -- The document has an IETF Trust Provisions (28 Dec 2009) Section 6.c(i) Publication Limitation clause. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == It seems as if not all pages are separated by form feeds - found 0 form feeds but 10 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 2 instances of too long lines in the document, the longest one being 3 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 08, 2020) is 1563 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Draft Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 423 -- Looks like a reference, but probably isn't: '2' on line 425 -- Looks like a reference, but probably isn't: '3' on line 427 -- Looks like a reference, but probably isn't: '4' on line 429 -- Looks like a reference, but probably isn't: '5' on line 431 -- Looks like a reference, but probably isn't: '6' on line 433 -- Looks like a reference, but probably isn't: '7' on line 435 -- Looks like a reference, but probably isn't: '8' on line 437 -- Duplicate reference: RFC2119, mentioned in 'RFC8174', was also mentioned in 'RFC2119'. Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Koster 2 Internet-Draft Stalworthy Computing, Ltd. 3 Intended status: Draft Standard G. Illyes 4 Expires: June 9, 2020 H. Zeller 5 L. Harvey 6 Google 7 January 08, 2020 9 Robots Exclusion Protocol 10 draft-koster-rep-01 12 Abstract 14 This document standardizes and extends the "Robots Exclusion 15 Protocol" method originally defined by 16 Martijn Koster in 1996 for service owners to control how content 17 served by their services may be accessed, if at all, by automatic 18 clients known as crawlers. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This document may not be modified, and derivative works of it may not 36 be created, except to format it for publication as an RFC or to 37 translate it into languages other than English. 39 This Internet-Draft will expire on June 9, 2020. 41 Copyright Notice 43 Copyright (c) 2019 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 59 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 2 60 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3 61 2.1. Protocol definition . . . . . . . . . . . . . . . . . . . 3 62 2.2. Formal syntax . . . . . . . . . . . . . . . . . . . . . . 3 63 2.2.1. The user-agent line . . . . . . . . . . . . . . . . . 4 64 2.2.2. The Allow and Disallow lines . . . . . . . . . . . . 4 65 2.2.3. Special characters . . . . . . . . . . . . . . . . . 5 66 2.2.4. Other records . . . . . . . . . . . . . . . . . . . . 6 67 2.3. Access method . . . . . . . . . . . . . . . . . . . . . . 6 68 2.3.1. Access results . . . . . . . . . . . . . . . . . . . 7 69 2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 8 70 2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 8 71 2.6. Security Considerations . . . . . . . . . . . . . . . . . 8 72 2.7. IANA Considerations . . . . . . . . . . . . . . . . . . . 8 73 3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 8 74 3.1. Simple example . . . . . . . . . . . . . . . . . . . . . 8 75 3.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 9 76 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 77 4.1. Normative References . . . . . . . . . . . . . . . . . . 9 78 4.2. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 9 79 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 81 1. Introduction 83 This document applies to services that provide resources that clients 84 can access through URIs as defined in RFC3986 [1]. For example, in 85 the context of HTTP, a browser is a client that displays the content 86 of a web page. 88 Crawlers are automated clients. Search engines for instance have 89 crawlers to recursively traverse links for indexing as defined in 90 RFC8288 [2]. 92 It may be inconvenient for service owners if crawlers visit the 93 entirety of their URI space. This document specifies the rules that 94 crawlers MUST obey when accessing URIs. 96 These rules are not a form of access authorization. 98 1.1. Terminology 100 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 101 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 102 "OPTIONAL" in this document are to be interpreted as described in 103 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 104 capitals, as shown here. 106 2. Specification 108 2.1. Protocol definition 110 The protocol language consists of rule(s) and group(s): 112 o *Rule*: A line with a key-value pair that defines how a crawler 113 may access URIs. See section The Allow and Disallow lines. 115 o *Group*: One or more user-agent lines that is followed by one or 116 more rules. The group is terminated by a user-agent line or end 117 of file. See User-agent line. The last group may have no rules, 118 which means it implicitly allows everything. 120 2.2. Formal syntax 122 Below is an Augmented Backus-Naur Form (ABNF) description, as 123 described in RFC5234 [3]. 125 robotstxt = *(group / emptyline) 126 group = startgroupline ; We start with a user-agent 127 *(startgroupline / emptyline) ; ... and possibly more 128 ; user-agents 129 *(rule / emptyline) ; followed by rules relevant 130 ; for UAs 132 startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL 134 rule = *WS ("allow" / "disallow") *WS ":" 135 *WS (path-pattern / empty-pattern) EOL 137 ; parser implementors: add additional lines you need (for 138 ; example Sitemaps), and be lenient when reading lines that don't 139 ; conform. Apply Postel's law. 141 product-token = identifier / "*" 142 path-pattern = "/" *(UTF8-char-noctl) ; valid URI path pattern 143 empty-pattern = *WS 145 identifier = 1*(%x2d / %x41-5a / %x5f / %x61-7a) 146 comment = "#"*(UTF8-char-noctl / WS / "#") 147 emptyline = EOL EOL = *WS [comment] NL ; end-of-line may have 148 ; optional trailing comment 149 NL = %x0D / %x0A / %x0D.0A 150 WS = %x20 / %x09 151 ; UTF8 derived from RFC3629, but excluding control characters 153 UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4 154 UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#' 155 UTF8-2 = %xC2-DF UTF8-tail 156 UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / 157 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) 158 UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) / 159 %xF4 %x80-8F 2( UTF8-tail ) 161 UTF8-tail = %x80-BF 163 2.2.1. The user-agent line 165 Crawlers set a product token to find relevant groups. The product 166 token MUST contain only "a-zA-Z_-" characters. The product token 167 SHOULD be part of the identification string that the crawler sends 168 to the service (for example, in the case of HTTP, the product name 169 SHOULD be in the user-agent header). The identification string 170 SHOULD describe the purpose of the crawler. Here's an example of an 171 HTTP header with a link pointing to a page describing the purpose of 172 the ExampleBot crawler which appears both in the HTTP header and as a 173 product token: 175 +-------------------------------------------------+-----------------+ 176 | HTTP header | robots.txt | 177 | | user-agent line | 178 +-------------------------------------------------+-----------------+ 179 | user-agent: Mozilla/5.0 (compatible; | user-agent: | 180 | ExampleBot/0.1; | ExampleBot | 181 | https://www.example.com/bot.html) | | 182 +-------------------------------------------------+-----------------+ 184 Crawlers MUST find the group that matches the product token exactly, 185 and then obey the rules of the group. If there is more than one 186 group matching the user-agent, the matching groups' rules MUST be 187 combined into one group. The matching MUST be case-insensitive. If 188 no matching group exists, crawlers MUST obey the first group with a 189 user-agent line with a "*" value, if present. If no group satisfies 190 either condition, or no groups are present at all, no rules apply. 192 2.2.2. The Allow and Disallow lines 194 These lines indicate whether accessing a URI that matches the 195 corresponding path is allowed or disallowed. 197 To evaluate if access to a URI is allowed, a robot MUST match the 198 paths in allow and disallow rules against the URI. The matching 199 SHOULD be case sensitive. The most specific match found MUST be 200 used. The most specific match is the match that has the most octets. 201 If an allow and disallow rule is equivalent, the allow SHOULD be 202 used. If no match is found amongst the rules in a group for a 203 matching user-agent, or there are no rules in the group, the URI is 204 allowed. The /robots.txt URI is implicitly allowed. 206 Octets in the URI and robots.txt paths outside the range of the US- 207 ASCII coded character set, and those in the reserved range defined by 208 RFC3986 [1], MUST be percent-encoded as defined by RFC3986 [1] prior 209 to comparison. 211 If a percent-encoded US-ASCII octet is encountered in the URI, it 212 MUST be unencoded prior to comparison, unless it is a reserved 213 character in the URI as defined by RFC3986 [1] or the character is 214 outside the unreserved character range. The match evaluates 215 positively if and only if the end of the path from the rule is 216 reached before a difference in octets is encountered. 218 For example: 220 +-------------------+-----------------------+-----------------------+ 221 | Path | Encoded Path | Path to match | 222 +-------------------+-----------------------+-----------------------+ 223 | /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz | 224 | | | | 225 | /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% | 226 | ://foo.bar | 2F%2Ffoo.bar | 2F%2Ffoo.bar | 227 | | | | 228 | /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | 229 | | | | 230 | /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | 231 | 4 | | | 232 | | | | 233 | /foo/bar/%62%61%7 | /foo/bar/%62%61%7A | /foo/bar/baz | 234 | A | | | 235 +-------------------+-----------------------+-----------------------+ 237 The crawler SHOULD ignore "disallow" and "allow" rules that are not 238 in any group (for example, any rule that precedes the first user- 239 agent line). 241 Implementers MAY bridge encoding mismatches if they detect that the 242 robots.txt file is not UTF8 encoded. 244 2.2.3. Special characters 246 Crawlers SHOULD allow the following special characters: 248 +-----------+--------------------------------+----------------------+ 249 | Character | Description | Example | 250 +-----------+--------------------------------+----------------------+ 251 | "#" | Designates an end of line | "allow: / # comment | 252 | | comment. | in line" | 253 | | | | 254 | | | "# comment at the | 255 | | | end" | 256 | | | | 257 | "$" | Designates the end of the | "allow: | 258 | | match pattern. A URI MUST end | /this/path/exactly$" | 259 | | with a $. | | 260 | | | | 261 | "*" | Designates 0 or more instances | "allow: | 262 | | of any character. | /this/*/exactly" | 263 +-----------+--------------------------------+----------------------+ 265 If crawlers match special characters verbatim in the URI, crawlers 266 SHOULD use "%" encoding. For example: 268 +------------------------+------------------------------------------+ 269 | Pattern | URI | 270 +------------------------+------------------------------------------+ 271 | /path/file- | https://www.example.com/path/file- | 272 | with-a-%2A.html | with-a-*.html | 273 | | | 274 | /path/foo-%24 | https://www.example.com/path/foo-$ | 275 +------------------------+------------------------------------------+ 277 2.2.4. Other records 279 Clients MAY interpret other records that are not part of the 280 robots.txt protocol. For example, 'sitemap' [4]. 282 2.3. Access method 284 The rules MUST be accessible in a file named "/robots.txt" (all lower 285 case) in the top level path of the service. The file MUST be UTF-8 286 encoded (as defined in RFC3629 [5]) and Internet Media Type "text/ 287 plain" (as defined in RFC2046 [6]). 289 As per RFC3986 [1], the URI of the robots.txt is: 291 "scheme:[//authority]/robots.txt" 293 For example, in the context of HTTP or FTP, the URI is: 295 http://www.example.com/robots.txt 296 https://www.example.com/robots.txt 298 ftp://ftp.example.com/robots.txt 300 2.3.1. Access results 302 2.3.1.1. Successful access 304 If the crawler successfully downloads the robots.txt, the crawler 305 MUST follow the parseable rules. 307 2.3.1.2. Redirects 309 The server may respond to a robots.txt fetch request with a redirect, 310 such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least 311 five consecutive redirects, even across authorities (for example 312 hosts in case of HTTP), as defined in RFC1945 [7]. 314 If a robots.txt file is reached within five consecutive redirects, 315 the robots.txt file MUST be fetched, parsed, and its rules followed 316 in the context of the initial authority. 318 If there are more than five consecutive redirects, crawlers MAY 319 assume that the robots.txt is unavailable. 321 2.3.1.3. Unavailable status 323 Unavailable means the crawler tries to fetch the robots.txt, and the 324 server responds with unavailable status codes. For example, in the 325 context of HTTP, unavailable status codes are in the 400-499 range. 327 If a server status code indicates that the robots.txt file is 328 unavailable to the client, then crawlers MAY access any resources on 329 the server or MAY use a cached version of a robots.txt file for up to 330 24 hours. 332 2.3.1.4. Unreachable status 334 If the robots.txt is unreachable due to server or network errors, 335 this means the robots.txt is undefined and the crawler MUST assume 336 complete disallow. For example, in the context of HTTP, an 337 unreachable robots.txt has a response code in the 500-599 range. For 338 other undefined status codes, the crawler MUST assume the robots.txt 339 is unreachable. 341 If the robots.txt is undefined for a reasonably long period of time 342 (for example, 30 days), clients MAY assume the robots.txt is 343 unavailable or continue to use a cached copy. 345 2.3.1.5. Parsing errors 347 Crawlers SHOULD try to parse each line of the robots.txt file. 348 Crawlers MUST use the parseable rules. 350 2.4. Caching 352 Crawlers MAY cache the fetched robots.txt file's contents. Crawlers 353 MAY use standard cache control as defined in RFC2616 [8]. Crawlers 354 SHOULD NOT use the cached version for more than 24 hours, unless the 355 robots.txt is unreachable. 357 2.5. Limits 359 Crawlers MAY impose a parsing limit that MUST be at least 500 360 kibibytes (KiB). 362 2.6. Security Considerations 364 The Robots Exclusion Protocol MUST NOT be used as a form of security 365 measures. Listing URIs in the robots.txt file exposes the URI 366 publicly and thus making the URIs discoverable. 368 2.7. IANA Considerations. 370 This document has no actions for IANA. 372 3. Examples 374 3.1. Simple example 376 The following example shows: 378 o *foobot*: A regular case. A single user-agent token followed by 379 rules. 380 o *barbot and bazbot*: A group that's relevant for more than one 381 user-agent. 382 o *quxbot:* Empty group at end of file. 384 385 User-Agent : foobot 386 Disallow : /example/page.html 387 Disallow : /example/disallowed.gif 389 User-Agent : barbot 390 User-Agent : bazbot 391 Allow : /example/page.html 392 Disallow : /example/disallowed.gif 394 User-Agent: quxbot 396 EOF 397 399 3.2. Longest Match 401 The following example shows that in the case of a two rules, the 402 longest one MUST be used for matching. In the following case, 403 /example/page/disallowed.gif MUST be used for the URI 404 example.com/example/page/disallow.gif . 406 407 User-Agent : foobot 408 Allow : /example/page/ 409 Disallow : /example/page/disallowed.gif 410 412 4. References 414 4.1. Normative References 416 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 417 Requirement Levels", BCP 14, RFC 2119, March 1997. 418 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in 419 RFC 2119 Key Words", BCP 14, RFC 2119, May 2017. 421 4.2. URIs 423 [1] https://tools.ietf.org/html/rfc3986 425 [2] https://tools.ietf.org/html/rfc8288 427 [3] https://tools.ietf.org/html/rfc5234 429 [4] https://www.sitemaps.org/index.html 431 [5] https://tools.ietf.org/html/rfc3629 433 [6] https://tools.ietf.org/html/rfc2046 435 [7] https://tools.ietf.org/html/rfc1945 437 [8] https://tools.ietf.org/html/rfc2616 439 Authors' Address 441 Martijn Koster 442 Stalworthy Manor Farm 443 Suton Lane, NR18 9JG 444 Wymondham, Norfolk 445 United Kingdom 446 Email: m.koster@greenhills.co.uk 448 Gary Illyes 449 Brandschenkestrasse 110 450 8002, Zurich 451 Switzerland 452 Email: garyillyes@google.com 454 Henner Zeller 455 1600 Amphitheatre Pkwy 456 Mountain View, CA 94043 457 USA 458 Email: henner@google.com 460 Lizzi Harvey 461 1600 Amphitheatre Pkwy 462 Mountain View, CA 94043 463 USA 464 Email: lizzi@google.com