idnits 2.17.1 draft-koster-rep-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- -- The document has an IETF Trust Provisions (28 Dec 2009) Section 6.c(i) Publication Limitation clause. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 05, 2021) is 1049 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '1' on line 424 -- Looks like a reference, but probably isn't: '2' on line 426 -- Looks like a reference, but probably isn't: '3' on line 428 -- Looks like a reference, but probably isn't: '4' on line 430 -- Looks like a reference, but probably isn't: '5' on line 432 -- Looks like a reference, but probably isn't: '6' on line 434 -- Looks like a reference, but probably isn't: '7' on line 436 -- Looks like a reference, but probably isn't: '8' on line 438 -- Looks like a reference, but probably isn't: '9' on line 440 -- Duplicate reference: RFC2119, mentioned in 'RFC8174', was also mentioned in 'RFC2119'. Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 13 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Koster 2 Internet-Draft Stalworthy Computing, Ltd. 3 Intended status: Informational G. Illyes 4 Expires: December 2, 2021 H. Zeller 5 L. Harvey 6 Google 7 June 05, 2021 9 Robots Exclusion Protocol 10 draft-koster-rep-05 12 Abstract 14 This document specifies and extends the "Robots Exclusion 15 Protocol" [1] method originally defined by Martijn Koster in 1996 for 16 service owners to control how content served by their services may be 17 accessed, if at all, by automatic clients known as crawlers. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This document may not be modified, and derivative works of it may not 35 be created, except to format it for publication as an RFC or to 36 translate it into languages other than English. 38 This Internet-Draft will expire on December 2, 2021. 40 Copyright Notice 42 Copyright (c) 2020 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 58 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 2 59 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3 60 2.1. Protocol definition . . . . . . . . . . . . . . . . . . . 3 61 2.2. Formal syntax . . . . . . . . . . . . . . . . . . . . . . 3 62 2.2.1. The user-agent line . . . . . . . . . . . . . . . . . 4 63 2.2.2. The Allow and Disallow lines . . . . . . . . . . . . 4 64 2.2.3. Special characters . . . . . . . . . . . . . . . . . 5 65 2.2.4. Other records . . . . . . . . . . . . . . . . . . . . 6 66 2.3. Access method . . . . . . . . . . . . . . . . . . . . . . 6 67 2.3.1. Access results . . . . . . . . . . . . . . . . . . . 7 68 2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 8 69 2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 8 70 2.6. Security Considerations . . . . . . . . . . . . . . . . . 8 71 2.7. IANA Considerations . . . . . . . . . . . . . . . . . . . 8 72 3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 8 73 3.1. Simple example . . . . . . . . . . . . . . . . . . . . . 8 74 3.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 9 75 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 76 4.1. Normative References . . . . . . . . . . . . . . . . . . 9 77 4.2. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 9 78 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 80 1. Introduction 82 This document applies to services that provide resources that clients 83 can access through URIs as defined in RFC3986 [2]. For example, in 84 the context of HTTP, a browser is a client that displays the content 85 of a web page. 87 Crawlers are automated clients. Search engines for instance have 88 crawlers to recursively traverse links for indexing as defined in 89 RFC8288 [3]. 91 It may be inconvenient for service owners if crawlers visit the 92 entirety of their URI space. This document specifies the rules that 93 crawlers are expected to obey when accessing URIs. 95 These rules are not a form of access authorization. 97 1.1. Terminology 99 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 100 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 101 "OPTIONAL" in this document are to be interpreted as described in 102 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 103 capitals, as shown here. 105 2. Specification 107 2.1. Protocol definition 109 The protocol language consists of rule(s) and group(s) that the 110 service makes available in a file named 'robots.txt' as described in 111 section 2.3: 113 o *Rule*: A line with a key-value pair that defines how a crawler 114 may access URIs. See section 2.2.2. 116 o *Group*: One or more user-agent lines that is followed by one or 117 more rules. The group is terminated by a user-agent line or end 118 of file. See 2.2.1. The last group may have no rules, which means 119 it implicitly allows everything. 121 2.2. Formal syntax 123 Below is an Augmented Backus-Naur Form (ABNF) description, as 124 described in RFC5234 [4]. 126 robotstxt = *(group / emptyline) 127 group = startgroupline ; We start with a user-agent 128 *(startgroupline / emptyline) ; ... and possibly more 129 ; user-agents 130 *(rule / emptyline) ; followed by rules relevant 131 ; for UAs 133 startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL 135 rule = *WS ("allow" / "disallow") *WS ":" 136 *WS (path-pattern / empty-pattern) EOL 138 ; parser implementors: add additional lines you need (for 139 ; example Sitemaps), and be lenient when reading lines that don't 140 ; conform. Apply Postel's law. 142 product-token = identifier / "*" 143 path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern 144 empty-pattern = *WS 146 identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A) 147 comment = "#" *(UTF8-char-noctl / WS / "#") 148 emptyline = EOL 149 EOL = *WS [comment] NL ; end-of-line may have 150 ; optional trailing comment 151 NL = %x0D / %x0A / %x0D.0A 152 WS = %x20 / %x09 153 ; UTF8 derived from RFC3629, but excluding control characters 155 UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4 156 UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#' 157 UTF8-2 = %xC2-DF UTF8-tail 158 UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail / 159 %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail 160 UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail / 161 %xF4 %x80-8F 2UTF8-tail 163 UTF8-tail = %x80-BF 165 2.2.1. The user-agent line 167 Crawlers set a product token to find relevant groups. The product 168 token MUST contain only "a-zA-Z_-" characters. The product token 169 SHOULD be part of the identification string that the crawler sends 170 to the service (for example, in the case of HTTP, the product name 171 SHOULD be in the user-agent header). The identification string 172 SHOULD describe the purpose of the crawler. Here's an example of an 173 HTTP header with a link pointing to a page describing the purpose of 174 the ExampleBot crawler which appears both in the HTTP header and as a 175 product token: 177 +-------------------------------------------------+-----------------+ 178 | HTTP header | robots.txt | 179 | | user-agent line | 180 +-------------------------------------------------+-----------------+ 181 | user-agent: Mozilla/5.0 (compatible; | user-agent: | 182 | ExampleBot/0.1; | ExampleBot | 183 | https://www.example.com/bot.html) | | 184 +-------------------------------------------------+-----------------+ 186 Crawlers MUST find the group that matches the product token exactly, 187 and then obey the rules of the group. If there is more than one 188 group matching the user-agent, the matching groups' rules MUST be 189 combined into one group. The matching MUST be case-insensitive. If 190 no matching group exists, crawlers MUST obey the first group with a 191 user-agent line with a "*" value, if present. If no group satisfies 192 either condition, or no groups are present at all, no rules apply. 194 2.2.2. The Allow and Disallow lines 196 These lines indicate whether accessing a URI that matches the 197 corresponding path is allowed or disallowed. 199 To evaluate if access to a URI is allowed, a robot MUST match the 200 paths in allow and disallow rules against the URI. The matching 201 SHOULD be case sensitive. The most specific match found MUST be 202 used. The most specific match is the match that has the most octets. 203 If an allow and disallow rule is equivalent, the allow SHOULD be 204 used. If no match is found amongst the rules in a group for a 205 matching user-agent, or there are no rules in the group, the URI is 206 allowed. The /robots.txt URI is implicitly allowed. 208 Octets in the URI and robots.txt paths outside the range of the US- 209 ASCII coded character set, and those in the reserved range defined by 210 RFC3986 [2], MUST be percent-encoded as defined by RFC3986 [1] prior 211 to comparison. 213 If a percent-encoded US-ASCII octet is encountered in the URI, it 214 MUST be unencoded prior to comparison, unless it is a reserved 215 character in the URI as defined by RFC3986 [2] or the character is 216 outside the unreserved character range. The match evaluates 217 positively if and only if the end of the path from the rule is 218 reached before a difference in octets is encountered. 220 For example: 222 +-------------------+-----------------------+-----------------------+ 223 | Path | Encoded Path | Path to match | 224 +-------------------+-----------------------+-----------------------+ 225 | /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz | 226 | | | | 227 | /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% | 228 | ://foo.bar | 2F%2Ffoo.bar | 2F%2Ffoo.bar | 229 | | | | 230 | /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | 231 | | | | 232 | /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | 233 | 4 | | | 234 | | | | 235 | /foo/bar/%62%61%7 | /foo/bar/%62%61%7A | /foo/bar/baz | 236 | A | | | 237 +-------------------+-----------------------+-----------------------+ 239 The crawler SHOULD ignore "disallow" and "allow" rules that are not 240 in any group (for example, any rule that precedes the first user- 241 agent line). 243 Implementers MAY bridge encoding mismatches if they detect that the 244 robots.txt file is not UTF8 encoded. 246 2.2.3. Special characters 248 Crawlers SHOULD allow the following special characters: 250 +-----------+--------------------------------+----------------------+ 251 | Character | Description | Example | 252 +-----------+--------------------------------+----------------------+ 253 | "#" | Designates an end of line | "allow: / # comment | 254 | | comment. | in line" | 255 | | | | 256 | | | "# comment at the | 257 | | | end" | 258 | | | | 259 | "$" | Designates the end of the | "allow: | 260 | | match pattern. A URI MUST end | /this/path/exactly$" | 261 | | with a $. | | 262 | | | | 263 | "*" | Designates 0 or more instances | "allow: | 264 | | of any character. | /this/*/exactly" | 265 +-----------+--------------------------------+----------------------+ 267 If crawlers match special characters verbatim in the URI, crawlers 268 SHOULD use "%" encoding. For example: 270 +------------------------+------------------------------------------+ 271 | Pattern | URI | 272 +------------------------+------------------------------------------+ 273 | /path/file- | https://www.example.com/path/file- | 274 | with-a-%2A.html | with-a-*.html | 275 | | | 276 | /path/foo-%24 | https://www.example.com/path/foo-$ | 277 +------------------------+------------------------------------------+ 279 2.2.4. Other records 281 Clients MAY interpret other records that are not part of the 282 robots.txt protocol. For example, 'sitemap' [5]. 284 2.3. Access method 286 The rules MUST be accessible in a file named "/robots.txt" (all lower 287 case) in the top level path of the service. The file MUST be UTF-8 288 encoded (as defined in RFC3629 [6]) and Internet Media Type "text/ 289 plain" (as defined in RFC2046 [7]). 291 As per RFC3986 [2], the URI of the robots.txt is: 293 "scheme:[//authority]/robots.txt" 295 For example, in the context of HTTP or FTP, the URI is: 297 http://www.example.com/robots.txt 298 https://www.example.com/robots.txt 300 ftp://ftp.example.com/robots.txt 302 2.3.1. Access results 304 2.3.1.1. Successful access 306 If the crawler successfully downloads the robots.txt, the crawler 307 MUST follow the parseable rules. 309 2.3.1.2. Redirects 311 The server may respond to a robots.txt fetch request with a redirect, 312 such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least 313 five consecutive redirects, even across authorities (for example 314 hosts in case of HTTP), as defined in RFC1945 [8]. 316 If a robots.txt file is reached within five consecutive redirects, 317 the robots.txt file MUST be fetched, parsed, and its rules followed 318 in the context of the initial authority. 320 If there are more than five consecutive redirects, crawlers MAY 321 assume that the robots.txt is unavailable. 323 2.3.1.3. Unavailable status 325 Unavailable means the crawler tries to fetch the robots.txt, and the 326 server responds with unavailable status codes. For example, in the 327 context of HTTP, unavailable status codes are in the 400-499 range. 329 If a server status code indicates that the robots.txt file is 330 unavailable to the client, then crawlers MAY access any resources on 331 the server or MAY use a cached version of a robots.txt file for up to 332 24 hours. 334 2.3.1.4. Unreachable status 336 If the robots.txt is unreachable due to server or network errors, 337 this means the robots.txt is undefined and the crawler MUST assume 338 complete disallow. For example, in the context of HTTP, an 339 unreachable robots.txt has a response code in the 500-599 range. For 340 other undefined status codes, the crawler MUST assume the robots.txt 341 is unreachable. 343 If the robots.txt is undefined for a reasonably long period of time 344 (for example, 30 days), clients MAY assume the robots.txt is 345 unavailable or continue to use a cached copy. 347 2.3.1.5. Parsing errors 349 Crawlers SHOULD try to parse each line of the robots.txt file. 350 Crawlers MUST use the parseable rules. 352 2.4. Caching 354 Crawlers MAY cache the fetched robots.txt file's contents. Crawlers 355 MAY use standard cache control as defined in RFC2616 [9]. Crawlers 356 SHOULD NOT use the cached version for more than 24 hours, unless the 357 robots.txt is unreachable. 359 2.5. Limits 361 Crawlers MAY impose a parsing limit that MUST be at least 500 362 kibibytes (KiB). 364 2.6. Security Considerations 366 The Robots Exclusion Protocol is not a substitute for more valid 367 content security measures. Listing URIs in the robots.txt file 368 exposes the URI publicly and thus making the URIs discoverable. 370 2.7. IANA Considerations. 372 This document has no actions for IANA. 374 3. Examples 376 3.1. Simple example 378 The following example shows: 380 o *foobot*: A regular case. A single user-agent token followed by 381 rules. 382 o *barbot and bazbot*: A group that's relevant for more than one 383 user-agent. 384 o *quxbot:* Empty group at end of file. 386 387 User-Agent : foobot 388 Disallow : /example/page.html 389 Disallow : /example/disallowed.gif 391 User-Agent : barbot 392 User-Agent : bazbot 393 Allow : /example/page.html 394 Disallow : /example/disallowed.gif 396 User-Agent: quxbot 398 EOF 399 401 3.2. Longest Match 403 The following example shows that in the case of two rules, the 404 longest one is used for matching. In the following case, 405 /example/page/disallowed.gif MUST be used for the URI 406 example.com/example/page/disallow.gif . 408 409 User-Agent : foobot 410 Allow : /example/page/ 411 Disallow : /example/page/disallowed.gif 412 414 4. References 416 4.1. Normative References 418 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 419 Requirement Levels", BCP 14, RFC 2119, March 1997. 420 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in 421 RFC 2119 Key Words", BCP 14, RFC 2119, May 2017. 423 4.2. URIs 424 [1] http://www.robotstxt.org/ 426 [2] https://tools.ietf.org/html/rfc3986 428 [3] https://tools.ietf.org/html/rfc8288 430 [4] https://tools.ietf.org/html/rfc5234 432 [5] https://www.sitemaps.org/index.html 434 [6] https://tools.ietf.org/html/rfc3629 436 [7] https://tools.ietf.org/html/rfc2046 438 [8] https://tools.ietf.org/html/rfc1945 440 [9] https://tools.ietf.org/html/rfc2616 442 Authors' Address 444 Martijn Koster 445 Stalworthy Manor Farm 446 Suton Lane, NR18 9JG 447 Wymondham, Norfolk 448 United Kingdom 449 Email: m.koster@greenhills.co.uk 451 Gary Illyes 452 Brandschenkestrasse 110 453 8002, Zurich 454 Switzerland 455 Email: garyillyes@google.com 457 Henner Zeller 458 1600 Amphitheatre Pkwy 459 Mountain View, CA 94043 460 USA 461 Email: henner@google.com 463 Lizzi Harvey 464 1600 Amphitheatre Pkwy 465 Mountain View, CA 94043 466 USA 467 Email: lizzi@google.com