idnits 2.17.1 draft-koster-rep-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (5 May 2022) is 720 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group M. Koster, Ed. 3 Internet-Draft Stalworthy Computing, Ltd. 4 Intended status: Informational G. Illyes, Ed. 5 Expires: 6 November 2022 H. Zeller, Ed. 6 L. Sassman, Ed. 7 Google LLC. 8 5 May 2022 10 Robots Exclusion Protocol 11 draft-koster-rep-07 13 Abstract 15 This document specifies and extends the "Robots Exclusion Protocol" 16 method originally defined by Martijn Koster in 1996 for service 17 owners to control how content served by their services may be 18 accessed, if at all, by automatic clients known as crawlers. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at https://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on 6 November 2022. 37 Copyright Notice 39 Copyright (c) 2022 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 44 license-info) in effect on the date of publication of this document. 45 Please review these documents carefully, as they describe your rights 46 and restrictions with respect to this document. Code Components 47 extracted from this document must include Revised BSD License text as 48 described in Section 4.e of the Trust Legal Provisions and are 49 provided without warranty as described in the Revised BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 55 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3 56 2.1. Protocol Definition . . . . . . . . . . . . . . . . . . . 3 57 2.2. Formal Syntax . . . . . . . . . . . . . . . . . . . . . . 3 58 2.2.1. The User-Agent Line . . . . . . . . . . . . . . . . . 5 59 2.2.2. The Allow and Disallow Lines . . . . . . . . . . . . 5 60 2.2.3. Special Characters . . . . . . . . . . . . . . . . . 6 61 2.2.4. Other Records . . . . . . . . . . . . . . . . . . . . 7 62 2.3. Access Method . . . . . . . . . . . . . . . . . . . . . . 7 63 2.3.1. Access Results . . . . . . . . . . . . . . . . . . . 8 64 2.3.1.1. Successful Access . . . . . . . . . . . . . . . . 8 65 2.3.1.2. Redirects . . . . . . . . . . . . . . . . . . . . 8 66 2.3.1.3. Unavailable Status . . . . . . . . . . . . . . . 8 67 2.3.1.4. Unreachable Status . . . . . . . . . . . . . . . 9 68 2.3.1.5. Parsing Errors . . . . . . . . . . . . . . . . . 9 69 2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 9 70 2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 9 71 3. Security Considerations . . . . . . . . . . . . . . . . . . . 9 72 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 73 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 9 74 5.1. Simple Example . . . . . . . . . . . . . . . . . . . . . 9 75 5.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 10 76 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 77 6.1. Normative References . . . . . . . . . . . . . . . . . . 10 78 6.2. Informative References . . . . . . . . . . . . . . . . . 11 79 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 81 1. Introduction 83 This document applies to services that provide resources that clients 84 can access through URIs as defined in [RFC3986]. For example, in the 85 context of HTTP, a browser is a client that displays the content of a 86 web page. 88 Crawlers are automated clients. Search engines for instance have 89 crawlers to recursively traverse links for indexing as defined in 90 [RFC8288]. 92 It may be inconvenient for service owners if crawlers visit the 93 entirety of their URI space. This document specifies the rules 94 originally defined by the "Robots Exclusion Protocol" [ROBOTSTXT] 95 that crawlers are expected to obey when accessing URIs. 97 These rules are not a form of access authorization. 99 1.1. Requirements Language 101 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 102 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 103 "OPTIONAL" in this document are to be interpreted as described in BCP 104 14 [RFC2119] [RFC8174] when, and only when, they appear in all 105 capitals, as shown here. 107 2. Specification 109 2.1. Protocol Definition 111 The protocol language consists of rule(s) and group(s) that the 112 service makes available in a file named 'robots.txt' as described in 113 section 2.3: 115 * Rule: A line with a key-value pair that defines how a crawler may 116 access URIs. See section 2.2.2. 118 * Group: One or more user-agent lines that is followed by one or 119 more rules. The group is terminated by a user-agent line or end 120 of file. See section 2.2.1. The last group may have no rules, 121 which means it implicitly allows everything. 123 2.2. Formal Syntax 125 Below is an Augmented Backus-Naur Form (ABNF) description, as 126 described in [RFC5234]. 128 robotstxt = *(group / emptyline) 129 group = startgroupline ; We start with a user-agent 130 *(startgroupline / emptyline) ; ... and possibly more 131 ; user-agents 132 *(rule / emptyline) ; followed by rules relevant 133 ; for UAs 135 startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL 137 rule = *WS ("allow" / "disallow") *WS ":" 138 *WS (path-pattern / empty-pattern) EOL 140 ; parser implementors: add additional lines you need (for 141 ; example, sitemaps), and be lenient when reading lines that don't 142 ; conform. Apply Postel's law. 144 product-token = identifier / "*" 145 path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern 146 empty-pattern = *WS 148 identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A) 149 comment = "#" *(UTF8-char-noctl / WS / "#") 150 emptyline = EOL 151 EOL = *WS [comment] NL ; end-of-line may have 152 ; optional trailing comment 153 NL = %x0D / %x0A / %x0D.0A 154 WS = %x20 / %x09 156 ; UTF8 derived from RFC3629, but excluding control characters 158 UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4 159 UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#' 160 UTF8-2 = %xC2-DF UTF8-tail 161 UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail / 162 %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail 163 UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail / 164 %xF4 %x80-8F 2UTF8-tail 166 UTF8-tail = %x80-BF 168 2.2.1. The User-Agent Line 170 Crawlers set a product token to find relevant groups. The product 171 token MUST contain only "a-zA-Z_-" characters. The product token 172 SHOULD be part of the identification string that the crawler sends to 173 the service (for example, in the case of HTTP, the product name 174 SHOULD be in the user-agent header). The identification string 175 SHOULD describe the purpose of the crawler. Here's an example of an 176 HTTP header with a link pointing to a page describing the purpose of 177 the ExampleBot crawler which appears both in the HTTP header and as a 178 product token: 180 +===================================+=================+ 181 | HTTP header | robots.txt | 182 | | user-agent line | 183 +===================================+=================+ 184 | user-agent: Mozilla/5.0 | user-agent: | 185 | (compatible; ExampleBot/0.1; | ExampleBot | 186 | https://www.example.com/bot.html) | | 187 +-----------------------------------+-----------------+ 189 Table 1: Example of a user-agent header and user- 190 agent robots.txt token for ExampleBot 192 Crawlers MUST find the group that matches the product token exactly, 193 and then obey the rules of the group. If there is more than one 194 group matching the user-agent, the matching groups' rules MUST be 195 combined into one group. The matching MUST be case-insensitive. If 196 no matching group exists, crawlers MUST obey the first group with a 197 user-agent line with a "*" value, if present. If no group satisfies 198 either condition, or no groups are present at all, no rules apply. 200 2.2.2. The Allow and Disallow Lines 202 These lines indicate whether accessing a URI that matches the 203 corresponding path is allowed or disallowed. 205 To evaluate if access to a URI is allowed, a robot MUST match the 206 paths in allow and disallow rules against the URI. The matching 207 SHOULD be case sensitive. The most specific match found MUST be 208 used. The most specific match is the match that has the most octets. 209 If an allow and disallow rule is equivalent, the allow SHOULD be 210 used. If no match is found amongst the rules in a group for a 211 matching user-agent, or there are no rules in the group, the URI is 212 allowed. The /robots.txt URI is implicitly allowed. 214 Octets in the URI and robots.txt paths outside the range of the US- 215 ASCII coded character set, and those in the reserved range defined by 216 [RFC3986], MUST be percent-encoded as defined by [RFC3986] prior to 217 comparison. 219 If a percent-encoded US-ASCII octet is encountered in the URI, it 220 MUST be unencoded prior to comparison, unless it is a reserved 221 character in the URI as defined by [RFC3986] or the character is 222 outside the unreserved character range. The match evaluates 223 positively if and only if the end of the path from the rule is 224 reached before a difference in octets is encountered. 226 For example: 228 +===================+======================+======================+ 229 | Path | Encoded Path | Path to Match | 230 +===================+======================+======================+ 231 | /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz | 232 +-------------------+----------------------+----------------------+ 233 | /foo/bar?baz=http | /foo/bar?baz=http%3A | /foo/bar?baz=http%3A | 234 | ://foo.bar | %2F%2Ffoo.bar | %2F%2Ffoo.bar | 235 +-------------------+----------------------+----------------------+ 236 | /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | 237 +-------------------+----------------------+----------------------+ 238 | /foo/ | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | 239 | bar/%E3%83%84 | | | 240 +-------------------+----------------------+----------------------+ 241 | /foo/ | /foo/bar/%62%61%7A | /foo/bar/baz | 242 | bar/%62%61%7A | | | 243 +-------------------+----------------------+----------------------+ 245 Table 2: Examples of matching percent-encoded URI components 247 The crawler SHOULD ignore "disallow" and "allow" rules that are not 248 in any group (for example, any rule that precedes the first user- 249 agent line). 251 Implementers MAY bridge encoding mismatches if they detect that the 252 robots.txt file is not UTF8 encoded. 254 2.2.3. Special Characters 256 Crawlers SHOULD allow the following special characters: 258 +===========+===================+==============================+ 259 | Character | Description | Example | 260 +===========+===================+==============================+ 261 | "#" | Designates an end | "allow: / # comment in line" | 262 | | of line comment. | | 263 | | | "# comment on its own line" | 264 +-----------+-------------------+------------------------------+ 265 | "$" | Designates the | "allow: /this/path/exactly$" | 266 | | end of the match | | 267 | | pattern. | | 268 +-----------+-------------------+------------------------------+ 269 | "*" | Designates 0 or | "allow: /this/*/exactly" | 270 | | more instances of | | 271 | | any character. | | 272 +-----------+-------------------+------------------------------+ 274 Table 3: List of special characters in robots.txt files 276 If crawlers match special characters verbatim in the URI, crawlers 277 SHOULD use "%" encoding. For example: 279 +============================+===============================+ 280 | Percent-encoded Pattern | URI | 281 +============================+===============================+ 282 | /path/file-with-a-%2A.html | https://www.example.com/path/ | 283 | | file-with-a-*.html | 284 +----------------------------+-------------------------------+ 285 | /path/foo-%24 | https://www.example.com/path/ | 286 | | foo-$ | 287 +----------------------------+-------------------------------+ 289 Table 4: Example of percent-encoding 291 2.2.4. Other Records 293 Clients MAY interpret other records that are not part of the 294 robots.txt protocol. For example, 'sitemap' [SITEMAPS]. Parsing of 295 other records MUST NOT interfere with the parsing of explicitly 296 defined records in section 2. 298 2.3. Access Method 300 The rules MUST be accessible in a file named "/robots.txt" (all lower 301 case) in the top level path of the service. The file MUST be UTF-8 302 encoded (as defined in [RFC3629]) and Internet Media Type "text/ 303 plain" (as defined in [RFC2046]). 305 As per [RFC3986], the URI of the robots.txt is: 307 "scheme:[//authority]/robots.txt" 309 For example, in the context of HTTP or FTP, the URI is: 311 http://www.example.com/robots.txt 313 https://www.example.com/robots.txt 315 ftp://ftp.example.com/robots.txt 317 2.3.1. Access Results 319 2.3.1.1. Successful Access 321 If the crawler successfully downloads the robots.txt, the crawler 322 MUST follow the parseable rules. 324 2.3.1.2. Redirects 326 The server may respond to a robots.txt fetch request with a redirect, 327 such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least 328 five consecutive redirects, even across authorities (for example, 329 hosts in case of HTTP), as defined in [RFC1945]. 331 If a robots.txt file is reached within five consecutive redirects, 332 the robots.txt file MUST be fetched, parsed, and its rules followed 333 in the context of the initial authority. 335 If there are more than five consecutive redirects, crawlers MAY 336 assume that the robots.txt is unavailable. 338 2.3.1.3. Unavailable Status 340 Unavailable means the crawler tries to fetch the robots.txt, and the 341 server responds with unavailable status codes. For example, in the 342 context of HTTP, unavailable status codes are in the 400-499 range. 344 If a server status code indicates that the robots.txt file is 345 unavailable to the client, then crawlers MAY access any resources on 346 the server. 348 2.3.1.4. Unreachable Status 350 If the robots.txt is unreachable due to server or network errors, 351 this means the robots.txt is undefined and the crawler MUST assume 352 complete disallow. For example, in the context of HTTP, an 353 unreachable robots.txt has a response code in the 500-599 range. For 354 other undefined status codes, the crawler MUST assume the robots.txt 355 is unreachable. 357 If the robots.txt is undefined for a reasonably long period of time 358 (for example, 30 days), clients MAY assume the robots.txt is 359 unavailable or continue to use a cached copy. 361 2.3.1.5. Parsing Errors 363 Crawlers SHOULD try to parse each line of the robots.txt file. 364 Crawlers MUST use the parseable rules. 366 2.4. Caching 368 Crawlers MAY cache the fetched robots.txt file's contents. Crawlers 369 MAY use standard cache control as defined in [RFC2616]. Crawlers 370 SHOULD NOT use the cached version for more than 24 hours, unless the 371 robots.txt is unreachable. 373 2.5. Limits 375 Crawlers MAY impose a parsing limit that MUST be at least 500 376 kibibytes (KiB). 378 3. Security Considerations 380 The Robots Exclusion Protocol is not a substitute for more valid 381 content security measures. Listing URIs in the robots.txt file 382 exposes the URI publicly and thus makes the URIs discoverable. 384 4. IANA Considerations 386 This document has no actions for IANA. 388 5. Examples 390 5.1. Simple Example 392 The following example shows: 394 * foobot: A regular case. A single user-agent token followed by 395 rules. 397 * barbot and bazbot: A group that's relevant for more than one user- 398 agent. 400 * quxbot: An empty group at end of the file. 402 User-Agent : foobot 403 Disallow : /example/page.html 404 Disallow : /example/disallowed.gif 406 User-Agent : barbot 407 User-Agent : bazbot 408 Allow : /example/page.html 409 Disallow : /example/disallowed.gif 411 User-Agent: quxbot 413 EOF 415 5.2. Longest Match 417 The following example shows that in the case of two rules, the 418 longest one is used for matching. In the following case, 419 /example/page/disallowed.gif MUST be used for the URI 420 example.com/example/page/disallow.gif. 422 User-Agent : foobot 423 Allow : /example/page/ 424 Disallow : /example/page/disallowed.gif 426 6. References 428 6.1. Normative References 430 [RFC1945] Berners-Lee, T., Fielding, R., and H. Frystyk, "Hypertext 431 Transfer Protocol -- HTTP/1.0", RFC 1945, 432 DOI 10.17487/RFC1945, May 1996, 433 . 435 [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 436 Extensions (MIME) Part Two: Media Types", RFC 2046, 437 DOI 10.17487/RFC2046, November 1996, 438 . 440 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 441 Requirement Levels", BCP 14, RFC 2119, 442 DOI 10.17487/RFC2119, March 1997, 443 . 445 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 446 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 447 Transfer Protocol -- HTTP/1.1", RFC 2616, 448 DOI 10.17487/RFC2616, June 1999, 449 . 451 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 452 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 453 2003, . 455 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 456 Resource Identifier (URI): Generic Syntax", STD 66, 457 RFC 3986, DOI 10.17487/RFC3986, January 2005, 458 . 460 [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax 461 Specifications: ABNF", STD 68, RFC 5234, 462 DOI 10.17487/RFC5234, January 2008, 463 . 465 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 466 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 467 May 2017, . 469 [RFC8288] Nottingham, M., "Web Linking", RFC 8288, 470 DOI 10.17487/RFC8288, October 2017, 471 . 473 6.2. Informative References 475 [ROBOTSTXT] 476 "Robots Exclusion Protocol", n.d., 477 . 479 [SITEMAPS] "Sitemaps Protocol", n.d., 480 . 482 Authors' Addresses 484 Martijn Koster (editor) 485 Stalworthy Computing, Ltd. 486 Suton Lane 487 Wymondham, Norfolk 488 NR18 9JG 489 United Kingdom 490 Email: m.koster@greenhills.co.uk 491 Gary Illyes (editor) 492 Google LLC. 493 Brandschenkestrasse 110 494 CH-8002 Zurich 495 Switzerland 496 Email: garyillyes@google.com 498 Henner Zeller (editor) 499 Google LLC. 500 1600 Amphitheatre Pkwy 501 Mountain View, CA, 94043 502 United States of America 503 Email: henner@google.com 505 Lizzi Sassman (editor) 506 Google LLC. 507 Brandschenkestrasse 110 508 CH-8002 Zurich 509 Switzerland 510 Email: lizzi@google.com