idnits 2.17.1 draft-koster-rep-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- -- The document has an IETF Trust Provisions (28 Dec 2009) Section 6.c(i) Publication Limitation clause. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 08, 2020) is 1234 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '1' on line 424 -- Looks like a reference, but probably isn't: '2' on line 426 -- Looks like a reference, but probably isn't: '3' on line 428 -- Looks like a reference, but probably isn't: '4' on line 430 -- Looks like a reference, but probably isn't: '5' on line 432 -- Looks like a reference, but probably isn't: '6' on line 434 -- Looks like a reference, but probably isn't: '7' on line 436 -- Looks like a reference, but probably isn't: '8' on line 438 -- Duplicate reference: RFC2119, mentioned in 'RFC8174', was also mentioned in 'RFC2119'. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Koster 2 Internet-Draft Stalworthy Computing, Ltd. 3 Intended status: Informational G. Illyes 4 Expires: June 5, 2021 H. Zeller 5 L. Harvey 6 Google 7 December 08, 2020 9 Robots Exclusion Protocol 10 draft-koster-rep-04 12 Abstract 14 This document standardizes and extends the "Robots Exclusion 15 Protocol" method originally defined by 16 Martijn Koster in 1996 for service owners to control how content 17 served by their services may be accessed, if at all, by automatic 18 clients known as crawlers. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This document may not be modified, and derivative works of it may not 36 be created, except to format it for publication as an RFC or to 37 translate it into languages other than English. 39 This Internet-Draft will expire on June 5, 2021. 41 Copyright Notice 43 Copyright (c) 2020 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 59 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 2 60 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3 61 2.1. Protocol definition . . . . . . . . . . . . . . . . . . . 3 62 2.2. Formal syntax . . . . . . . . . . . . . . . . . . . . . . 3 63 2.2.1. The user-agent line . . . . . . . . . . . . . . . . . 4 64 2.2.2. The Allow and Disallow lines . . . . . . . . . . . . 4 65 2.2.3. Special characters . . . . . . . . . . . . . . . . . 5 66 2.2.4. Other records . . . . . . . . . . . . . . . . . . . . 6 67 2.3. Access method . . . . . . . . . . . . . . . . . . . . . . 6 68 2.3.1. Access results . . . . . . . . . . . . . . . . . . . 7 69 2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 8 70 2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 8 71 2.6. Security Considerations . . . . . . . . . . . . . . . . . 8 72 2.7. IANA Considerations . . . . . . . . . . . . . . . . . . . 8 73 3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 8 74 3.1. Simple example . . . . . . . . . . . . . . . . . . . . . 8 75 3.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 9 76 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 77 4.1. Normative References . . . . . . . . . . . . . . . . . . 9 78 4.2. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 9 79 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 81 1. Introduction 83 This document applies to services that provide resources that clients 84 can access through URIs as defined in RFC3986 [1]. For example, in 85 the context of HTTP, a browser is a client that displays the content 86 of a web page. 88 Crawlers are automated clients. Search engines for instance have 89 crawlers to recursively traverse links for indexing as defined in 90 RFC8288 [2]. 92 It may be inconvenient for service owners if crawlers visit the 93 entirety of their URI space. This document specifies the rules that 94 crawlers MUST obey when accessing URIs. 96 These rules are not a form of access authorization. 98 1.1. Terminology 100 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 101 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 102 "OPTIONAL" in this document are to be interpreted as described in 103 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 104 capitals, as shown here. 106 2. Specification 108 2.1. Protocol definition 110 The protocol language consists of rule(s) and group(s): 112 o *Rule*: A line with a key-value pair that defines how a crawler 113 may access URIs. See section The Allow and Disallow lines. 115 o *Group*: One or more user-agent lines that is followed by one or 116 more rules. The group is terminated by a user-agent line or end 117 of file. See User-agent line. The last group may have no rules, 118 which means it implicitly allows everything. 120 2.2. Formal syntax 122 Below is an Augmented Backus-Naur Form (ABNF) description, as 123 described in RFC5234 [3]. 125 robotstxt = *(group / emptyline) 126 group = startgroupline ; We start with a user-agent 127 *(startgroupline / emptyline) ; ... and possibly more 128 ; user-agents 129 *(rule / emptyline) ; followed by rules relevant 130 ; for UAs 132 startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL 134 rule = *WS ("allow" / "disallow") *WS ":" 135 *WS (path-pattern / empty-pattern) EOL 137 ; parser implementors: add additional lines you need (for 138 ; example Sitemaps), and be lenient when reading lines that don't 139 ; conform. Apply Postel's law. 141 product-token = identifier / "*" 142 path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern 143 empty-pattern = *WS 145 identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A) 146 comment = "#" *(UTF8-char-noctl / WS / "#") 147 emptyline = EOL 148 EOL = *WS [comment] NL ; end-of-line may have 149 ; optional trailing comment 150 NL = %x0D / %x0A / %x0D.0A 151 WS = %x20 / %x09 152 ; UTF8 derived from RFC3629, but excluding control characters 154 UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4 155 UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#' 156 UTF8-2 = %xC2-DF UTF8-tail 157 UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail / 158 %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail 159 UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail / 160 %xF4 %x80-8F 2UTF8-tail 162 UTF8-tail = %x80-BF 164 2.2.1. The user-agent line 166 Crawlers set a product token to find relevant groups. The product 167 token MUST contain only "a-zA-Z_-" characters. The product token 168 SHOULD be part of the identification string that the crawler sends 169 to the service (for example, in the case of HTTP, the product name 170 SHOULD be in the user-agent header). The identification string 171 SHOULD describe the purpose of the crawler. Here's an example of an 172 HTTP header with a link pointing to a page describing the purpose of 173 the ExampleBot crawler which appears both in the HTTP header and as a 174 product token: 176 +-------------------------------------------------+-----------------+ 177 | HTTP header | robots.txt | 178 | | user-agent line | 179 +-------------------------------------------------+-----------------+ 180 | user-agent: Mozilla/5.0 (compatible; | user-agent: | 181 | ExampleBot/0.1; | ExampleBot | 182 | https://www.example.com/bot.html) | | 183 +-------------------------------------------------+-----------------+ 185 Crawlers MUST find the group that matches the product token exactly, 186 and then obey the rules of the group. If there is more than one 187 group matching the user-agent, the matching groups' rules MUST be 188 combined into one group. The matching MUST be case-insensitive. If 189 no matching group exists, crawlers MUST obey the first group with a 190 user-agent line with a "*" value, if present. If no group satisfies 191 either condition, or no groups are present at all, no rules apply. 193 2.2.2. The Allow and Disallow lines 195 These lines indicate whether accessing a URI that matches the 196 corresponding path is allowed or disallowed. 198 To evaluate if access to a URI is allowed, a robot MUST match the 199 paths in allow and disallow rules against the URI. The matching 200 SHOULD be case sensitive. The most specific match found MUST be 201 used. The most specific match is the match that has the most octets. 202 If an allow and disallow rule is equivalent, the allow SHOULD be 203 used. If no match is found amongst the rules in a group for a 204 matching user-agent, or there are no rules in the group, the URI is 205 allowed. The /robots.txt URI is implicitly allowed. 207 Octets in the URI and robots.txt paths outside the range of the US- 208 ASCII coded character set, and those in the reserved range defined by 209 RFC3986 [1], MUST be percent-encoded as defined by RFC3986 [1] prior 210 to comparison. 212 If a percent-encoded US-ASCII octet is encountered in the URI, it 213 MUST be unencoded prior to comparison, unless it is a reserved 214 character in the URI as defined by RFC3986 [1] or the character is 215 outside the unreserved character range. The match evaluates 216 positively if and only if the end of the path from the rule is 217 reached before a difference in octets is encountered. 219 For example: 221 +-------------------+-----------------------+-----------------------+ 222 | Path | Encoded Path | Path to match | 223 +-------------------+-----------------------+-----------------------+ 224 | /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz | 225 | | | | 226 | /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% | 227 | ://foo.bar | 2F%2Ffoo.bar | 2F%2Ffoo.bar | 228 | | | | 229 | /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | 230 | | | | 231 | /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | 232 | 4 | | | 233 | | | | 234 | /foo/bar/%62%61%7 | /foo/bar/%62%61%7A | /foo/bar/baz | 235 | A | | | 236 +-------------------+-----------------------+-----------------------+ 238 The crawler SHOULD ignore "disallow" and "allow" rules that are not 239 in any group (for example, any rule that precedes the first user- 240 agent line). 242 Implementers MAY bridge encoding mismatches if they detect that the 243 robots.txt file is not UTF8 encoded. 245 2.2.3. Special characters 247 Crawlers SHOULD allow the following special characters: 249 +-----------+--------------------------------+----------------------+ 250 | Character | Description | Example | 251 +-----------+--------------------------------+----------------------+ 252 | "#" | Designates an end of line | "allow: / # comment | 253 | | comment. | in line" | 254 | | | | 255 | | | "# comment at the | 256 | | | end" | 257 | | | | 258 | "$" | Designates the end of the | "allow: | 259 | | match pattern. A URI MUST end | /this/path/exactly$" | 260 | | with a $. | | 261 | | | | 262 | "*" | Designates 0 or more instances | "allow: | 263 | | of any character. | /this/*/exactly" | 264 +-----------+--------------------------------+----------------------+ 266 If crawlers match special characters verbatim in the URI, crawlers 267 SHOULD use "%" encoding. For example: 269 +------------------------+------------------------------------------+ 270 | Pattern | URI | 271 +------------------------+------------------------------------------+ 272 | /path/file- | https://www.example.com/path/file- | 273 | with-a-%2A.html | with-a-*.html | 274 | | | 275 | /path/foo-%24 | https://www.example.com/path/foo-$ | 276 +------------------------+------------------------------------------+ 278 2.2.4. Other records 280 Clients MAY interpret other records that are not part of the 281 robots.txt protocol. For example, 'sitemap' [4]. 283 2.3. Access method 285 The rules MUST be accessible in a file named "/robots.txt" (all lower 286 case) in the top level path of the service. The file MUST be UTF-8 287 encoded (as defined in RFC3629 [5]) and Internet Media Type "text/ 288 plain" (as defined in RFC2046 [6]). 290 As per RFC3986 [1], the URI of the robots.txt is: 292 "scheme:[//authority]/robots.txt" 294 For example, in the context of HTTP or FTP, the URI is: 296 http://www.example.com/robots.txt 297 https://www.example.com/robots.txt 299 ftp://ftp.example.com/robots.txt 301 2.3.1. Access results 303 2.3.1.1. Successful access 305 If the crawler successfully downloads the robots.txt, the crawler 306 MUST follow the parseable rules. 308 2.3.1.2. Redirects 310 The server may respond to a robots.txt fetch request with a redirect, 311 such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least 312 five consecutive redirects, even across authorities (for example 313 hosts in case of HTTP), as defined in RFC1945 [7]. 315 If a robots.txt file is reached within five consecutive redirects, 316 the robots.txt file MUST be fetched, parsed, and its rules followed 317 in the context of the initial authority. 319 If there are more than five consecutive redirects, crawlers MAY 320 assume that the robots.txt is unavailable. 322 2.3.1.3. Unavailable status 324 Unavailable means the crawler tries to fetch the robots.txt, and the 325 server responds with unavailable status codes. For example, in the 326 context of HTTP, unavailable status codes are in the 400-499 range. 328 If a server status code indicates that the robots.txt file is 329 unavailable to the client, then crawlers MAY access any resources on 330 the server or MAY use a cached version of a robots.txt file for up to 331 24 hours. 333 2.3.1.4. Unreachable status 335 If the robots.txt is unreachable due to server or network errors, 336 this means the robots.txt is undefined and the crawler MUST assume 337 complete disallow. For example, in the context of HTTP, an 338 unreachable robots.txt has a response code in the 500-599 range. For 339 other undefined status codes, the crawler MUST assume the robots.txt 340 is unreachable. 342 If the robots.txt is undefined for a reasonably long period of time 343 (for example, 30 days), clients MAY assume the robots.txt is 344 unavailable or continue to use a cached copy. 346 2.3.1.5. Parsing errors 348 Crawlers SHOULD try to parse each line of the robots.txt file. 349 Crawlers MUST use the parseable rules. 351 2.4. Caching 353 Crawlers MAY cache the fetched robots.txt file's contents. Crawlers 354 MAY use standard cache control as defined in RFC2616 [8]. Crawlers 355 SHOULD NOT use the cached version for more than 24 hours, unless the 356 robots.txt is unreachable. 358 2.5. Limits 360 Crawlers MAY impose a parsing limit that MUST be at least 500 361 kibibytes (KiB). 363 2.6. Security Considerations 365 The Robots Exclusion Protocol MUST NOT be used as a form of security 366 measures. Listing URIs in the robots.txt file exposes the URI 367 publicly and thus making the URIs discoverable. 369 2.7. IANA Considerations. 371 This document has no actions for IANA. 373 3. Examples 375 3.1. Simple example 377 The following example shows: 379 o *foobot*: A regular case. A single user-agent token followed by 380 rules. 381 o *barbot and bazbot*: A group that's relevant for more than one 382 user-agent. 383 o *quxbot:* Empty group at end of file. 385 386 User-Agent : foobot 387 Disallow : /example/page.html 388 Disallow : /example/disallowed.gif 390 User-Agent : barbot 391 User-Agent : bazbot 392 Allow : /example/page.html 393 Disallow : /example/disallowed.gif 395 User-Agent: quxbot 397 EOF 398 400 3.2. Longest Match 402 The following example shows that in the case of a two rules, the 403 longest one MUST be used for matching. In the following case, 404 /example/page/disallowed.gif MUST be used for the URI 405 example.com/example/page/disallow.gif . 407 408 User-Agent : foobot 409 Allow : /example/page/ 410 Disallow : /example/page/disallowed.gif 411 413 4. References 415 4.1. Normative References 417 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 418 Requirement Levels", BCP 14, RFC 2119, March 1997. 419 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in 420 RFC 2119 Key Words", BCP 14, RFC 2119, May 2017. 422 4.2. URIs 424 [1] https://tools.ietf.org/html/rfc3986 426 [2] https://tools.ietf.org/html/rfc8288 428 [3] https://tools.ietf.org/html/rfc5234 430 [4] https://www.sitemaps.org/index.html 432 [5] https://tools.ietf.org/html/rfc3629 434 [6] https://tools.ietf.org/html/rfc2046 436 [7] https://tools.ietf.org/html/rfc1945 438 [8] https://tools.ietf.org/html/rfc2616 440 Authors' Address 442 Martijn Koster 443 Stalworthy Manor Farm 444 Suton Lane, NR18 9JG 445 Wymondham, Norfolk 446 United Kingdom 447 Email: m.koster@greenhills.co.uk 449 Gary Illyes 450 Brandschenkestrasse 110 451 8002, Zurich 452 Switzerland 453 Email: garyillyes@google.com 455 Henner Zeller 456 1600 Amphitheatre Pkwy 457 Mountain View, CA 94043 458 USA 459 Email: henner@google.com 461 Lizzi Harvey 462 1600 Amphitheatre Pkwy 463 Mountain View, CA 94043 464 USA 465 Email: lizzi@google.com