| < draft-koster-rep-06.txt | draft-koster-rep-07.txt > | |||
|---|---|---|---|---|
| Network Working Group M. Koster | ||||
| Network Working Group M. Koster, Ed. | ||||
| Internet-Draft Stalworthy Computing, Ltd. | Internet-Draft Stalworthy Computing, Ltd. | |||
| Intended status: Informational G. Illyes | Intended status: Informational G. Illyes, Ed. | |||
| Expires: May 6, 2022 H. Zeller | Expires: 6 November 2022 H. Zeller, Ed. | |||
| L. Harvey | L. Sassman, Ed. | |||
| Google LLC. | ||||
| November 07, 2021 | 5 May 2022 | |||
| Robots Exclusion Protocol | Robots Exclusion Protocol | |||
| draft-koster-rep-06 | draft-koster-rep-07 | |||
| Abstract | Abstract | |||
| This document specifies and extends the "Robots Exclusion Protocol" | This document specifies and extends the "Robots Exclusion Protocol" | |||
| method originally defined by Martijn Koster in 1996 for service | method originally defined by Martijn Koster in 1996 for service | |||
| owners to control how content served by their services may be | owners to control how content served by their services may be | |||
| accessed, if at all, by automatic clients known as crawlers. | accessed, if at all, by automatic clients known as crawlers. | |||
| Status of This Memo | Status of This Memo | |||
| This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at https://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This document may not be modified, and derivative works of it may not | This Internet-Draft will expire on 6 November 2022. | |||
| be created, except to format it for publication as an RFC or to | ||||
| translate it into languages other than English. | ||||
| This Internet-Draft will expire on May 6, 2022. | ||||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2020 IETF Trust and the persons identified as the | Copyright (c) 2022 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents (https://trustee.ietf.org/ | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | license-info) in effect on the date of publication of this document. | |||
| publication of this document. Please review these documents | Please review these documents carefully, as they describe your rights | |||
| carefully, as they describe your rights and restrictions with respect | and restrictions with respect to this document. Code Components | |||
| to this document. Code Components extracted from this document must | extracted from this document must include Revised BSD License text as | |||
| include Simplified BSD License text as described in Section 4.e of | described in Section 4.e of the Trust Legal Provisions and are | |||
| the Trust Legal Provisions and are provided without warranty as | provided without warranty as described in the Revised BSD License. | |||
| described in the Simplified BSD License. | ||||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
| 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 2 | 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 | |||
| 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3 | 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 2.1. Protocol definition . . . . . . . . . . . . . . . . . . . 3 | 2.1. Protocol Definition . . . . . . . . . . . . . . . . . . . 3 | |||
| 2.2. Formal syntax . . . . . . . . . . . . . . . . . . . . . . 3 | 2.2. Formal Syntax . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 2.2.1. The user-agent line . . . . . . . . . . . . . . . . . 4 | 2.2.1. The User-Agent Line . . . . . . . . . . . . . . . . . 5 | |||
| 2.2.2. The Allow and Disallow lines . . . . . . . . . . . . 4 | 2.2.2. The Allow and Disallow Lines . . . . . . . . . . . . 5 | |||
| 2.2.3. Special characters . . . . . . . . . . . . . . . . . 5 | 2.2.3. Special Characters . . . . . . . . . . . . . . . . . 6 | |||
| 2.2.4. Other records . . . . . . . . . . . . . . . . . . . . 6 | 2.2.4. Other Records . . . . . . . . . . . . . . . . . . . . 7 | |||
| 2.3. Access method . . . . . . . . . . . . . . . . . . . . . . 6 | 2.3. Access Method . . . . . . . . . . . . . . . . . . . . . . 7 | |||
| 2.3.1. Access results . . . . . . . . . . . . . . . . . . . 7 | 2.3.1. Access Results . . . . . . . . . . . . . . . . . . . 8 | |||
| 2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 8 | 2.3.1.1. Successful Access . . . . . . . . . . . . . . . . 8 | |||
| 2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 8 | 2.3.1.2. Redirects . . . . . . . . . . . . . . . . . . . . 8 | |||
| 2.6. Security Considerations . . . . . . . . . . . . . . . . . 8 | 2.3.1.3. Unavailable Status . . . . . . . . . . . . . . . 8 | |||
| 2.7. IANA Considerations . . . . . . . . . . . . . . . . . . . 8 | 2.3.1.4. Unreachable Status . . . . . . . . . . . . . . . 9 | |||
| 3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 8 | 2.3.1.5. Parsing Errors . . . . . . . . . . . . . . . . . 9 | |||
| 3.1. Simple example . . . . . . . . . . . . . . . . . . . . . 8 | 2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 9 | |||
| 3.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 9 | 2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 9 | |||
| 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 | 3. Security Considerations . . . . . . . . . . . . . . . . . . . 9 | |||
| 4.1. Normative References . . . . . . . . . . . . . . . . . . 9 | 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 | |||
| 4.2. Informative References. . . . . . . . . . . . . . . . . . 9 | 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 9 | |||
| Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 | 5.1. Simple Example . . . . . . . . . . . . . . . . . . . . . 9 | |||
| 5.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 10 | ||||
| 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 | ||||
| 6.1. Normative References . . . . . . . . . . . . . . . . . . 10 | ||||
| 6.2. Informative References . . . . . . . . . . . . . . . . . 11 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 | ||||
| 1. Introduction | 1. Introduction | |||
| This document applies to services that provide resources that clients | This document applies to services that provide resources that clients | |||
| can access through URIs as defined in [RFC3986]. For example, in the | can access through URIs as defined in [RFC3986]. For example, in the | |||
| context of HTTP, a browser is a client that displays the content of a | context of HTTP, a browser is a client that displays the content of a | |||
| web page. | web page. | |||
| Crawlers are automated clients. Search engines for instance have | Crawlers are automated clients. Search engines for instance have | |||
| crawlers to recursively traverse links for indexing as defined in | crawlers to recursively traverse links for indexing as defined in | |||
| [RFC8288]. | [RFC8288]. | |||
| It may be inconvenient for service owners if crawlers visit the | It may be inconvenient for service owners if crawlers visit the | |||
| entirety of their URI space. This document specifies the rules | entirety of their URI space. This document specifies the rules | |||
| originally defined by the "Robots Exclusion Protocol" [1] that | originally defined by the "Robots Exclusion Protocol" [ROBOTSTXT] | |||
| crawlers are expected to obey when accessing URIs. | that crawlers are expected to obey when accessing URIs. | |||
| These rules are not a form of access authorization. | These rules are not a form of access authorization. | |||
| 1.1. Terminology | 1.1. Requirements Language | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | |||
| "OPTIONAL" in this document are to be interpreted as described in | "OPTIONAL" in this document are to be interpreted as described in BCP | |||
| BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all | 14 [RFC2119] [RFC8174] when, and only when, they appear in all | |||
| capitals, as shown here. | capitals, as shown here. | |||
| 2. Specification | 2. Specification | |||
| 2.1. Protocol definition | 2.1. Protocol Definition | |||
| The protocol language consists of rule(s) and group(s) that the | The protocol language consists of rule(s) and group(s) that the | |||
| service makes available in a file named 'robots.txt' as described in | service makes available in a file named 'robots.txt' as described in | |||
| section 2.3: | section 2.3: | |||
| o *Rule*: A line with a key-value pair that defines how a crawler | * Rule: A line with a key-value pair that defines how a crawler may | |||
| may access URIs. See section 2.2.2. | access URIs. See section 2.2.2. | |||
| o *Group*: One or more user-agent lines that is followed by one or | * Group: One or more user-agent lines that is followed by one or | |||
| more rules. The group is terminated by a user-agent line or end | more rules. The group is terminated by a user-agent line or end | |||
| of file. See 2.2.1. The last group may have no rules, which means | of file. See section 2.2.1. The last group may have no rules, | |||
| it implicitly allows everything. | which means it implicitly allows everything. | |||
| 2.2. Formal syntax | 2.2. Formal Syntax | |||
| Below is an Augmented Backus-Naur Form (ABNF) description, as | Below is an Augmented Backus-Naur Form (ABNF) description, as | |||
| described in [RFC5234]. | described in [RFC5234]. | |||
| robotstxt = *(group / emptyline) | robotstxt = *(group / emptyline) | |||
| group = startgroupline ; We start with a user-agent | group = startgroupline ; We start with a user-agent | |||
| *(startgroupline / emptyline) ; ... and possibly more | *(startgroupline / emptyline) ; ... and possibly more | |||
| ; user-agents | ; user-agents | |||
| *(rule / emptyline) ; followed by rules relevant | *(rule / emptyline) ; followed by rules relevant | |||
| ; for UAs | ; for UAs | |||
| startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL | startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL | |||
| rule = *WS ("allow" / "disallow") *WS ":" | rule = *WS ("allow" / "disallow") *WS ":" | |||
| *WS (path-pattern / empty-pattern) EOL | *WS (path-pattern / empty-pattern) EOL | |||
| ; parser implementors: add additional lines you need (for | ; parser implementors: add additional lines you need (for | |||
| ; example Sitemaps), and be lenient when reading lines that don't | ; example, sitemaps), and be lenient when reading lines that don't | |||
| ; conform. Apply Postel's law. | ; conform. Apply Postel's law. | |||
| product-token = identifier / "*" | product-token = identifier / "*" | |||
| path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern | path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern | |||
| empty-pattern = *WS | empty-pattern = *WS | |||
| identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A) | identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A) | |||
| comment = "#" *(UTF8-char-noctl / WS / "#") | comment = "#" *(UTF8-char-noctl / WS / "#") | |||
| emptyline = EOL | emptyline = EOL | |||
| EOL = *WS [comment] NL ; end-of-line may have | EOL = *WS [comment] NL ; end-of-line may have | |||
| ; optional trailing comment | ; optional trailing comment | |||
| NL = %x0D / %x0A / %x0D.0A | NL = %x0D / %x0A / %x0D.0A | |||
| WS = %x20 / %x09 | WS = %x20 / %x09 | |||
| ; UTF8 derived from [RFC3629], but excluding control characters | ||||
| UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4 | ; UTF8 derived from RFC3629, but excluding control characters | |||
| UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#' | ||||
| UTF8-2 = %xC2-DF UTF8-tail | ||||
| UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail / | ||||
| %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail | ||||
| UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail / | ||||
| %xF4 %x80-8F 2UTF8-tail | ||||
| UTF8-tail = %x80-BF | UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4 | |||
| UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#' | ||||
| UTF8-2 = %xC2-DF UTF8-tail | ||||
| UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail / | ||||
| %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail | ||||
| UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail / | ||||
| %xF4 %x80-8F 2UTF8-tail | ||||
| 2.2.1. The user-agent line | UTF8-tail = %x80-BF | |||
| 2.2.1. The User-Agent Line | ||||
| Crawlers set a product token to find relevant groups. The product | Crawlers set a product token to find relevant groups. The product | |||
| token MUST contain only "a-zA-Z_-" characters. The product token | token MUST contain only "a-zA-Z_-" characters. The product token | |||
| SHOULD be part of the identification string that the crawler sends | SHOULD be part of the identification string that the crawler sends to | |||
| to the service (for example, in the case of HTTP, the product name | the service (for example, in the case of HTTP, the product name | |||
| SHOULD be in the user-agent header). The identification string | SHOULD be in the user-agent header). The identification string | |||
| SHOULD describe the purpose of the crawler. Here's an example of an | SHOULD describe the purpose of the crawler. Here's an example of an | |||
| HTTP header with a link pointing to a page describing the purpose of | HTTP header with a link pointing to a page describing the purpose of | |||
| the ExampleBot crawler which appears both in the HTTP header and as a | the ExampleBot crawler which appears both in the HTTP header and as a | |||
| product token: | product token: | |||
| +-------------------------------------------------+-----------------+ | +===================================+=================+ | |||
| | HTTP header | robots.txt | | | HTTP header | robots.txt | | |||
| | | user-agent line | | | | user-agent line | | |||
| +-------------------------------------------------+-----------------+ | +===================================+=================+ | |||
| | user-agent: Mozilla/5.0 (compatible; | user-agent: | | | user-agent: Mozilla/5.0 | user-agent: | | |||
| | ExampleBot/0.1; | ExampleBot | | | (compatible; ExampleBot/0.1; | ExampleBot | | |||
| | https://www.example.com/bot.html) | | | | https://www.example.com/bot.html) | | | |||
| +-------------------------------------------------+-----------------+ | +-----------------------------------+-----------------+ | |||
| Table 1: Example of a user-agent header and user- | ||||
| agent robots.txt token for ExampleBot | ||||
| Crawlers MUST find the group that matches the product token exactly, | Crawlers MUST find the group that matches the product token exactly, | |||
| and then obey the rules of the group. If there is more than one | and then obey the rules of the group. If there is more than one | |||
| group matching the user-agent, the matching groups' rules MUST be | group matching the user-agent, the matching groups' rules MUST be | |||
| combined into one group. The matching MUST be case-insensitive. If | combined into one group. The matching MUST be case-insensitive. If | |||
| no matching group exists, crawlers MUST obey the first group with a | no matching group exists, crawlers MUST obey the first group with a | |||
| user-agent line with a "*" value, if present. If no group satisfies | user-agent line with a "*" value, if present. If no group satisfies | |||
| either condition, or no groups are present at all, no rules apply. | either condition, or no groups are present at all, no rules apply. | |||
| 2.2.2. The Allow and Disallow lines | 2.2.2. The Allow and Disallow Lines | |||
| These lines indicate whether accessing a URI that matches the | These lines indicate whether accessing a URI that matches the | |||
| corresponding path is allowed or disallowed. | corresponding path is allowed or disallowed. | |||
| To evaluate if access to a URI is allowed, a robot MUST match the | To evaluate if access to a URI is allowed, a robot MUST match the | |||
| paths in allow and disallow rules against the URI. The matching | paths in allow and disallow rules against the URI. The matching | |||
| SHOULD be case sensitive. The most specific match found MUST be | SHOULD be case sensitive. The most specific match found MUST be | |||
| used. The most specific match is the match that has the most octets. | used. The most specific match is the match that has the most octets. | |||
| If an allow and disallow rule is equivalent, the allow SHOULD be | If an allow and disallow rule is equivalent, the allow SHOULD be | |||
| used. If no match is found amongst the rules in a group for a | used. If no match is found amongst the rules in a group for a | |||
| skipping to change at page 5, line 22 ¶ | skipping to change at page 6, line 19 ¶ | |||
| If a percent-encoded US-ASCII octet is encountered in the URI, it | If a percent-encoded US-ASCII octet is encountered in the URI, it | |||
| MUST be unencoded prior to comparison, unless it is a reserved | MUST be unencoded prior to comparison, unless it is a reserved | |||
| character in the URI as defined by [RFC3986] or the character is | character in the URI as defined by [RFC3986] or the character is | |||
| outside the unreserved character range. The match evaluates | outside the unreserved character range. The match evaluates | |||
| positively if and only if the end of the path from the rule is | positively if and only if the end of the path from the rule is | |||
| reached before a difference in octets is encountered. | reached before a difference in octets is encountered. | |||
| For example: | For example: | |||
| +-------------------+-----------------------+-----------------------+ | +===================+======================+======================+ | |||
| | Path | Encoded Path | Path to match | | | Path | Encoded Path | Path to Match | | |||
| +-------------------+-----------------------+-----------------------+ | +===================+======================+======================+ | |||
| | /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz | | | /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz | | |||
| | | | | | +-------------------+----------------------+----------------------+ | |||
| | /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% | | | /foo/bar?baz=http | /foo/bar?baz=http%3A | /foo/bar?baz=http%3A | | |||
| | ://foo.bar | 2F%2Ffoo.bar | 2F%2Ffoo.bar | | | ://foo.bar | %2F%2Ffoo.bar | %2F%2Ffoo.bar | | |||
| | | | | | +-------------------+----------------------+----------------------+ | |||
| | /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | | /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | |||
| | | | | | +-------------------+----------------------+----------------------+ | |||
| | /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | | /foo/ | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | |||
| | 4 | | | | | bar/%E3%83%84 | | | | |||
| | | | | | +-------------------+----------------------+----------------------+ | |||
| | /foo/bar/%62%61%7 | /foo/bar/%62%61%7A | /foo/bar/baz | | | /foo/ | /foo/bar/%62%61%7A | /foo/bar/baz | | |||
| | A | | | | | bar/%62%61%7A | | | | |||
| +-------------------+-----------------------+-----------------------+ | +-------------------+----------------------+----------------------+ | |||
| Table 2: Examples of matching percent-encoded URI components | ||||
| The crawler SHOULD ignore "disallow" and "allow" rules that are not | The crawler SHOULD ignore "disallow" and "allow" rules that are not | |||
| in any group (for example, any rule that precedes the first user- | in any group (for example, any rule that precedes the first user- | |||
| agent line). | agent line). | |||
| Implementers MAY bridge encoding mismatches if they detect that the | Implementers MAY bridge encoding mismatches if they detect that the | |||
| robots.txt file is not UTF8 encoded. | robots.txt file is not UTF8 encoded. | |||
| 2.2.3. Special characters | 2.2.3. Special Characters | |||
| Crawlers SHOULD allow the following special characters: | Crawlers SHOULD allow the following special characters: | |||
| +-----------+--------------------------------+----------------------+ | +===========+===================+==============================+ | |||
| | Character | Description | Example | | | Character | Description | Example | | |||
| +-----------+--------------------------------+----------------------+ | +===========+===================+==============================+ | |||
| | "#" | Designates an end of line | "allow: / # comment | | | "#" | Designates an end | "allow: / # comment in line" | | |||
| | | comment. | in line" | | | | of line comment. | | | |||
| | | | | | | | | "# comment on its own line" | | |||
| | | | "# comment at the | | +-----------+-------------------+------------------------------+ | |||
| | | | end" | | | "$" | Designates the | "allow: /this/path/exactly$" | | |||
| | | | | | | | end of the match | | | |||
| | "$" | Designates the end of the | "allow: | | | | pattern. | | | |||
| | | match pattern. A URI MUST end | /this/path/exactly$" | | +-----------+-------------------+------------------------------+ | |||
| | | with a $. | | | | "*" | Designates 0 or | "allow: /this/*/exactly" | | |||
| | | | | | | | more instances of | | | |||
| | "*" | Designates 0 or more instances | "allow: | | | | any character. | | | |||
| | | of any character. | /this/*/exactly" | | +-----------+-------------------+------------------------------+ | |||
| +-----------+--------------------------------+----------------------+ | ||||
| Table 3: List of special characters in robots.txt files | ||||
| If crawlers match special characters verbatim in the URI, crawlers | If crawlers match special characters verbatim in the URI, crawlers | |||
| SHOULD use "%" encoding. For example: | SHOULD use "%" encoding. For example: | |||
| +------------------------+------------------------------------------+ | +============================+===============================+ | |||
| | Pattern | URI | | | Percent-encoded Pattern | URI | | |||
| +------------------------+------------------------------------------+ | +============================+===============================+ | |||
| | /path/file- | https://www.example.com/path/file- | | | /path/file-with-a-%2A.html | https://www.example.com/path/ | | |||
| | with-a-%2A.html | with-a-*.html | | | | file-with-a-*.html | | |||
| | | | | +----------------------------+-------------------------------+ | |||
| | /path/foo-%24 | https://www.example.com/path/foo-$ | | | /path/foo-%24 | https://www.example.com/path/ | | |||
| +------------------------+------------------------------------------+ | | | foo-$ | | |||
| +----------------------------+-------------------------------+ | ||||
| 2.2.4. Other records | Table 4: Example of percent-encoding | |||
| 2.2.4. Other Records | ||||
| Clients MAY interpret other records that are not part of the | Clients MAY interpret other records that are not part of the | |||
| robots.txt protocol. For example, 'sitemap' [2]. Parsing of other | robots.txt protocol. For example, 'sitemap' [SITEMAPS]. Parsing of | |||
| records MUST NOT interfere with the parsing of explicitly defined | other records MUST NOT interfere with the parsing of explicitly | |||
| records in section 2. | defined records in section 2. | |||
| 2.3. Access method | 2.3. Access Method | |||
| The rules MUST be accessible in a file named "/robots.txt" (all lower | The rules MUST be accessible in a file named "/robots.txt" (all lower | |||
| case) in the top level path of the service. The file MUST be UTF-8 | case) in the top level path of the service. The file MUST be UTF-8 | |||
| encoded (as defined in [RFC3629]) and Internet Media Type "text/ | encoded (as defined in [RFC3629]) and Internet Media Type "text/ | |||
| plain" (as defined in [RFC2046]). | plain" (as defined in [RFC2046]). | |||
| As per [RFC3986], the URI of the robots.txt is: | As per [RFC3986], the URI of the robots.txt is: | |||
| "scheme:[//authority]/robots.txt" | "scheme:[//authority]/robots.txt" | |||
| For example, in the context of HTTP or FTP, the URI is: | For example, in the context of HTTP or FTP, the URI is: | |||
| http://www.example.com/robots.txt | http://www.example.com/robots.txt | |||
| https://www.example.com/robots.txt | ||||
| ftp://ftp.example.com/robots.txt | https://www.example.com/robots.txt | |||
| 2.3.1. Access results | ftp://ftp.example.com/robots.txt | |||
| 2.3.1.1. Successful access | 2.3.1. Access Results | |||
| 2.3.1.1. Successful Access | ||||
| If the crawler successfully downloads the robots.txt, the crawler | If the crawler successfully downloads the robots.txt, the crawler | |||
| MUST follow the parseable rules. | MUST follow the parseable rules. | |||
| 2.3.1.2. Redirects | 2.3.1.2. Redirects | |||
| The server may respond to a robots.txt fetch request with a redirect, | The server may respond to a robots.txt fetch request with a redirect, | |||
| such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least | such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least | |||
| five consecutive redirects, even across authorities (for example | five consecutive redirects, even across authorities (for example, | |||
| hosts in case of HTTP), as defined in [RFC1945]. | hosts in case of HTTP), as defined in [RFC1945]. | |||
| If a robots.txt file is reached within five consecutive redirects, | If a robots.txt file is reached within five consecutive redirects, | |||
| the robots.txt file MUST be fetched, parsed, and its rules followed | the robots.txt file MUST be fetched, parsed, and its rules followed | |||
| in the context of the initial authority. | in the context of the initial authority. | |||
| If there are more than five consecutive redirects, crawlers MAY | If there are more than five consecutive redirects, crawlers MAY | |||
| assume that the robots.txt is unavailable. | assume that the robots.txt is unavailable. | |||
| 2.3.1.3. Unavailable status | 2.3.1.3. Unavailable Status | |||
| Unavailable means the crawler tries to fetch the robots.txt, and the | Unavailable means the crawler tries to fetch the robots.txt, and the | |||
| server responds with unavailable status codes. For example, in the | server responds with unavailable status codes. For example, in the | |||
| context of HTTP, unavailable status codes are in the 400-499 range. | context of HTTP, unavailable status codes are in the 400-499 range. | |||
| If a server status code indicates that the robots.txt file is | If a server status code indicates that the robots.txt file is | |||
| unavailable to the client, then crawlers MAY access any resources on | unavailable to the client, then crawlers MAY access any resources on | |||
| the server or MAY use a cached version of a robots.txt file for up to | the server. | |||
| 24 hours. | ||||
| 2.3.1.4. Unreachable status | 2.3.1.4. Unreachable Status | |||
| If the robots.txt is unreachable due to server or network errors, | If the robots.txt is unreachable due to server or network errors, | |||
| this means the robots.txt is undefined and the crawler MUST assume | this means the robots.txt is undefined and the crawler MUST assume | |||
| complete disallow. For example, in the context of HTTP, an | complete disallow. For example, in the context of HTTP, an | |||
| unreachable robots.txt has a response code in the 500-599 range. For | unreachable robots.txt has a response code in the 500-599 range. For | |||
| other undefined status codes, the crawler MUST assume the robots.txt | other undefined status codes, the crawler MUST assume the robots.txt | |||
| is unreachable. | is unreachable. | |||
| If the robots.txt is undefined for a reasonably long period of time | If the robots.txt is undefined for a reasonably long period of time | |||
| (for example, 30 days), clients MAY assume the robots.txt is | (for example, 30 days), clients MAY assume the robots.txt is | |||
| unavailable or continue to use a cached copy. | unavailable or continue to use a cached copy. | |||
| 2.3.1.5. Parsing errors | 2.3.1.5. Parsing Errors | |||
| Crawlers SHOULD try to parse each line of the robots.txt file. | Crawlers SHOULD try to parse each line of the robots.txt file. | |||
| Crawlers MUST use the parseable rules. | Crawlers MUST use the parseable rules. | |||
| 2.4. Caching | 2.4. Caching | |||
| Crawlers MAY cache the fetched robots.txt file's contents. Crawlers | Crawlers MAY cache the fetched robots.txt file's contents. Crawlers | |||
| MAY use standard cache control as defined in [RFC2616]. Crawlers | MAY use standard cache control as defined in [RFC2616]. Crawlers | |||
| SHOULD NOT use the cached version for more than 24 hours, unless the | SHOULD NOT use the cached version for more than 24 hours, unless the | |||
| robots.txt is unreachable. | robots.txt is unreachable. | |||
| 2.5. Limits | 2.5. Limits | |||
| Crawlers MAY impose a parsing limit that MUST be at least 500 | Crawlers MAY impose a parsing limit that MUST be at least 500 | |||
| kibibytes (KiB). | kibibytes (KiB). | |||
| 2.6. Security Considerations | 3. Security Considerations | |||
| The Robots Exclusion Protocol is not a substitute for more valid | The Robots Exclusion Protocol is not a substitute for more valid | |||
| content security measures. Listing URIs in the robots.txt file | content security measures. Listing URIs in the robots.txt file | |||
| exposes the URI publicly and thus making the URIs discoverable. | exposes the URI publicly and thus makes the URIs discoverable. | |||
| 2.7. IANA Considerations. | 4. IANA Considerations | |||
| This document has no actions for IANA. | This document has no actions for IANA. | |||
| 3. Examples | 5. Examples | |||
| 3.1. Simple example | 5.1. Simple Example | |||
| The following example shows: | The following example shows: | |||
| o *foobot*: A regular case. A single user-agent token followed by | * foobot: A regular case. A single user-agent token followed by | |||
| rules. | rules. | |||
| o *barbot and bazbot*: A group that's relevant for more than one | ||||
| user-agent. | ||||
| o *quxbot:* Empty group at end of file. | ||||
| <CODE BEGINS> | * barbot and bazbot: A group that's relevant for more than one user- | |||
| User-Agent : foobot | agent. | |||
| Disallow : /example/page.html | ||||
| Disallow : /example/disallowed.gif | ||||
| User-Agent : barbot | * quxbot: An empty group at end of the file. | |||
| User-Agent : bazbot | ||||
| Allow : /example/page.html | ||||
| Disallow : /example/disallowed.gif | ||||
| User-Agent: quxbot | User-Agent : foobot | |||
| Disallow : /example/page.html | ||||
| Disallow : /example/disallowed.gif | ||||
| EOF | User-Agent : barbot | |||
| <CODE ENDS> | User-Agent : bazbot | |||
| Allow : /example/page.html | ||||
| Disallow : /example/disallowed.gif | ||||
| 3.2. Longest Match | User-Agent: quxbot | |||
| EOF | ||||
| 5.2. Longest Match | ||||
| The following example shows that in the case of two rules, the | The following example shows that in the case of two rules, the | |||
| longest one is used for matching. In the following case, | longest one is used for matching. In the following case, | |||
| /example/page/disallowed.gif MUST be used for the URI | /example/page/disallowed.gif MUST be used for the URI | |||
| example.com/example/page/disallow.gif . | example.com/example/page/disallow.gif. | |||
| <CODE BEGINS> | User-Agent : foobot | |||
| User-Agent : foobot | Allow : /example/page/ | |||
| Allow : /example/page/ | Disallow : /example/page/disallowed.gif | |||
| Disallow : /example/page/disallowed.gif | ||||
| <CODE ENDS> | ||||
| 4. References | 6. References | |||
| 4.1. Normative References | 6.1. Normative References | |||
| [RFC1945] Berners-Lee, T., Fielding, R., and H. Frystyk, | ||||
| "Hypertext Transfer Protocol -- HTTP/1.0", RFC 1945, | [RFC1945] Berners-Lee, T., Fielding, R., and H. Frystyk, "Hypertext | |||
| May 1996. | Transfer Protocol -- HTTP/1.0", RFC 1945, | |||
| [RFC2046] Freed, N., Borenstein, N., "Multipurpose Internet Mail | DOI 10.17487/RFC1945, May 1996, | |||
| <https://www.rfc-editor.org/info/rfc1945>. | ||||
| [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | ||||
| Extensions (MIME) Part Two: Media Types", RFC 2046, | Extensions (MIME) Part Two: Media Types", RFC 2046, | |||
| November 1996. | DOI 10.17487/RFC2046, November 1996, | |||
| <https://www.rfc-editor.org/info/rfc2046>. | ||||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, | |||
| DOI 10.17487/RFC2119, March 1997, | ||||
| <https://www.rfc-editor.org/info/rfc2119>. | ||||
| [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., | [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., | |||
| Masinter, L., Leach, P., Berners-Lee, T., "Hypertext | Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext | |||
| Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. | Transfer Protocol -- HTTP/1.1", RFC 2616, | |||
| DOI 10.17487/RFC2616, June 1999, | ||||
| <https://www.rfc-editor.org/info/rfc2616>. | ||||
| [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | |||
| 10646", STD 63, RFC 3629, November 2003. | 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November | |||
| 2003, <https://www.rfc-editor.org/info/rfc3629>. | ||||
| [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform | [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform | |||
| Resource Identifier (URI): Generic Syntax", STD 66, | Resource Identifier (URI): Generic Syntax", STD 66, | |||
| RFC 3986, January 2005. | RFC 3986, DOI 10.17487/RFC3986, January 2005, | |||
| [RFC5234] Crocker, D., Overell, P., "Augmented BNF for Syntax | <https://www.rfc-editor.org/info/rfc3986>. | |||
| Specifications: ABNF", RFC 5234, STD 68, January 2008. | ||||
| [RFC8174] Leiba, B., "Ambiquity of Uppercase vs Lowercase in RFC | [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax | |||
| 2119 Key Words", BCP 14, RFC 2119, Mat 2017. | Specifications: ABNF", STD 68, RFC 5234, | |||
| [RFC8288] Nottingham, M., "Web Linking", RFC 8288, October 2017. | DOI 10.17487/RFC5234, January 2008, | |||
| <https://www.rfc-editor.org/info/rfc5234>. | ||||
| 4.2. Informative References | [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC | |||
| 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, | ||||
| May 2017, <https://www.rfc-editor.org/info/rfc8174>. | ||||
| [1] http://www.robotstxt.org/ | [RFC8288] Nottingham, M., "Web Linking", RFC 8288, | |||
| DOI 10.17487/RFC8288, October 2017, | ||||
| <https://www.rfc-editor.org/info/rfc8288>. | ||||
| [2] https://www.sitemaps.org/index.html | 6.2. Informative References | |||
| Authors' Address | [ROBOTSTXT] | |||
| "Robots Exclusion Protocol", n.d., | ||||
| <http://www.robotstxt.org/>. | ||||
| Martijn Koster | [SITEMAPS] "Sitemaps Protocol", n.d., | |||
| Stalworthy Manor Farm | <https://www.sitemaps.org/index.html>. | |||
| Suton Lane, NR18 9JG | ||||
| Authors' Addresses | ||||
| Martijn Koster (editor) | ||||
| Stalworthy Computing, Ltd. | ||||
| Suton Lane | ||||
| Wymondham, Norfolk | Wymondham, Norfolk | |||
| NR18 9JG | ||||
| United Kingdom | United Kingdom | |||
| Email: m.koster@greenhills.co.uk | Email: m.koster@greenhills.co.uk | |||
| Gary Illyes (editor) | ||||
| Gary Illyes | Google LLC. | |||
| Brandschenkestrasse 110 | Brandschenkestrasse 110 | |||
| 8002, Zurich | CH-8002 Zurich | |||
| Switzerland | Switzerland | |||
| Email: garyillyes@google.com | Email: garyillyes@google.com | |||
| Henner Zeller | Henner Zeller (editor) | |||
| Google LLC. | ||||
| 1600 Amphitheatre Pkwy | 1600 Amphitheatre Pkwy | |||
| Mountain View, CA 94043 | Mountain View, CA, 94043 | |||
| USA | United States of America | |||
| Email: henner@google.com | Email: henner@google.com | |||
| Lizzi Harvey | Lizzi Sassman (editor) | |||
| 1600 Amphitheatre Pkwy | Google LLC. | |||
| Mountain View, CA 94043 | Brandschenkestrasse 110 | |||
| USA | CH-8002 Zurich | |||
| Switzerland | ||||
| Email: lizzi@google.com | Email: lizzi@google.com | |||
| End of changes. 77 change blocks. | ||||
| 216 lines changed or deleted | 252 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||