< draft-koster-rep-06.txt   draft-koster-rep-07.txt >
Network Working Group M. Koster
Network Working Group M. Koster, Ed.
Internet-Draft Stalworthy Computing, Ltd. Internet-Draft Stalworthy Computing, Ltd.
Intended status: Informational G. Illyes Intended status: Informational G. Illyes, Ed.
Expires: May 6, 2022 H. Zeller Expires: 6 November 2022 H. Zeller, Ed.
L. Harvey L. Sassman, Ed.
Google Google LLC.
November 07, 2021 5 May 2022
Robots Exclusion Protocol Robots Exclusion Protocol
draft-koster-rep-06 draft-koster-rep-07
Abstract Abstract
This document specifies and extends the "Robots Exclusion Protocol" This document specifies and extends the "Robots Exclusion Protocol"
method originally defined by Martijn Koster in 1996 for service method originally defined by Martijn Koster in 1996 for service
owners to control how content served by their services may be owners to control how content served by their services may be
accessed, if at all, by automatic clients known as crawlers. accessed, if at all, by automatic clients known as crawlers.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This document may not be modified, and derivative works of it may not This Internet-Draft will expire on 6 November 2022.
be created, except to format it for publication as an RFC or to
translate it into languages other than English.
This Internet-Draft will expire on May 6, 2022.
Copyright Notice Copyright Notice
Copyright (c) 2020 IETF Trust and the persons identified as the Copyright (c) 2022 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents (https://trustee.ietf.org/
(http://trustee.ietf.org/license-info) in effect on the date of license-info) in effect on the date of publication of this document.
publication of this document. Please review these documents Please review these documents carefully, as they describe your rights
carefully, as they describe your rights and restrictions with respect and restrictions with respect to this document. Code Components
to this document. Code Components extracted from this document must extracted from this document must include Revised BSD License text as
include Simplified BSD License text as described in Section 4.e of described in Section 4.e of the Trust Legal Provisions and are
the Trust Legal Provisions and are provided without warranty as provided without warranty as described in the Revised BSD License.
described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1. Protocol definition . . . . . . . . . . . . . . . . . . . 3 2.1. Protocol Definition . . . . . . . . . . . . . . . . . . . 3
2.2. Formal syntax . . . . . . . . . . . . . . . . . . . . . . 3 2.2. Formal Syntax . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1. The user-agent line . . . . . . . . . . . . . . . . . 4 2.2.1. The User-Agent Line . . . . . . . . . . . . . . . . . 5
2.2.2. The Allow and Disallow lines . . . . . . . . . . . . 4 2.2.2. The Allow and Disallow Lines . . . . . . . . . . . . 5
2.2.3. Special characters . . . . . . . . . . . . . . . . . 5 2.2.3. Special Characters . . . . . . . . . . . . . . . . . 6
2.2.4. Other records . . . . . . . . . . . . . . . . . . . . 6 2.2.4. Other Records . . . . . . . . . . . . . . . . . . . . 7
2.3. Access method . . . . . . . . . . . . . . . . . . . . . . 6 2.3. Access Method . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1. Access results . . . . . . . . . . . . . . . . . . . 7 2.3.1. Access Results . . . . . . . . . . . . . . . . . . . 8
2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1.1. Successful Access . . . . . . . . . . . . . . . . 8
2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1.2. Redirects . . . . . . . . . . . . . . . . . . . . 8
2.6. Security Considerations . . . . . . . . . . . . . . . . . 8 2.3.1.3. Unavailable Status . . . . . . . . . . . . . . . 8
2.7. IANA Considerations . . . . . . . . . . . . . . . . . . . 8 2.3.1.4. Unreachable Status . . . . . . . . . . . . . . . 9
3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1.5. Parsing Errors . . . . . . . . . . . . . . . . . 9
3.1. Simple example . . . . . . . . . . . . . . . . . . . . . 8 2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 9 2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 9
4. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 3. Security Considerations . . . . . . . . . . . . . . . . . . . 9
4.1. Normative References . . . . . . . . . . . . . . . . . . 9 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9
4.2. Informative References. . . . . . . . . . . . . . . . . . 9 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 5.1. Simple Example . . . . . . . . . . . . . . . . . . . . . 9
5.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 10
6. References . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.1. Normative References . . . . . . . . . . . . . . . . . . 10
6.2. Informative References . . . . . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11
1. Introduction 1. Introduction
This document applies to services that provide resources that clients This document applies to services that provide resources that clients
can access through URIs as defined in [RFC3986]. For example, in the can access through URIs as defined in [RFC3986]. For example, in the
context of HTTP, a browser is a client that displays the content of a context of HTTP, a browser is a client that displays the content of a
web page. web page.
Crawlers are automated clients. Search engines for instance have Crawlers are automated clients. Search engines for instance have
crawlers to recursively traverse links for indexing as defined in crawlers to recursively traverse links for indexing as defined in
[RFC8288]. [RFC8288].
It may be inconvenient for service owners if crawlers visit the It may be inconvenient for service owners if crawlers visit the
entirety of their URI space. This document specifies the rules entirety of their URI space. This document specifies the rules
originally defined by the "Robots Exclusion Protocol" [1] that originally defined by the "Robots Exclusion Protocol" [ROBOTSTXT]
crawlers are expected to obey when accessing URIs. that crawlers are expected to obey when accessing URIs.
These rules are not a form of access authorization. These rules are not a form of access authorization.
1.1. Terminology 1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in "OPTIONAL" in this document are to be interpreted as described in BCP
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here. capitals, as shown here.
2. Specification 2. Specification
2.1. Protocol definition 2.1. Protocol Definition
The protocol language consists of rule(s) and group(s) that the The protocol language consists of rule(s) and group(s) that the
service makes available in a file named 'robots.txt' as described in service makes available in a file named 'robots.txt' as described in
section 2.3: section 2.3:
o *Rule*: A line with a key-value pair that defines how a crawler * Rule: A line with a key-value pair that defines how a crawler may
may access URIs. See section 2.2.2. access URIs. See section 2.2.2.
o *Group*: One or more user-agent lines that is followed by one or * Group: One or more user-agent lines that is followed by one or
more rules. The group is terminated by a user-agent line or end more rules. The group is terminated by a user-agent line or end
of file. See 2.2.1. The last group may have no rules, which means of file. See section 2.2.1. The last group may have no rules,
it implicitly allows everything. which means it implicitly allows everything.
2.2. Formal syntax 2.2. Formal Syntax
Below is an Augmented Backus-Naur Form (ABNF) description, as Below is an Augmented Backus-Naur Form (ABNF) description, as
described in [RFC5234]. described in [RFC5234].
robotstxt = *(group / emptyline) robotstxt = *(group / emptyline)
group = startgroupline ; We start with a user-agent group = startgroupline ; We start with a user-agent
*(startgroupline / emptyline) ; ... and possibly more *(startgroupline / emptyline) ; ... and possibly more
; user-agents ; user-agents
*(rule / emptyline) ; followed by rules relevant *(rule / emptyline) ; followed by rules relevant
; for UAs ; for UAs
startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL
rule = *WS ("allow" / "disallow") *WS ":" rule = *WS ("allow" / "disallow") *WS ":"
*WS (path-pattern / empty-pattern) EOL *WS (path-pattern / empty-pattern) EOL
; parser implementors: add additional lines you need (for ; parser implementors: add additional lines you need (for
; example Sitemaps), and be lenient when reading lines that don't ; example, sitemaps), and be lenient when reading lines that don't
; conform. Apply Postel's law. ; conform. Apply Postel's law.
product-token = identifier / "*" product-token = identifier / "*"
path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern
empty-pattern = *WS empty-pattern = *WS
identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A) identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A)
comment = "#" *(UTF8-char-noctl / WS / "#") comment = "#" *(UTF8-char-noctl / WS / "#")
emptyline = EOL emptyline = EOL
EOL = *WS [comment] NL ; end-of-line may have EOL = *WS [comment] NL ; end-of-line may have
; optional trailing comment ; optional trailing comment
NL = %x0D / %x0A / %x0D.0A NL = %x0D / %x0A / %x0D.0A
WS = %x20 / %x09 WS = %x20 / %x09
; UTF8 derived from [RFC3629], but excluding control characters
UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4 ; UTF8 derived from RFC3629, but excluding control characters
UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
%xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
%xF4 %x80-8F 2UTF8-tail
UTF8-tail = %x80-BF UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail /
%xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail
UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail /
%xF4 %x80-8F 2UTF8-tail
2.2.1. The user-agent line UTF8-tail = %x80-BF
2.2.1. The User-Agent Line
Crawlers set a product token to find relevant groups. The product Crawlers set a product token to find relevant groups. The product
token MUST contain only "a-zA-Z_-" characters. The product token token MUST contain only "a-zA-Z_-" characters. The product token
SHOULD be part of the identification string that the crawler sends SHOULD be part of the identification string that the crawler sends to
to the service (for example, in the case of HTTP, the product name the service (for example, in the case of HTTP, the product name
SHOULD be in the user-agent header). The identification string SHOULD be in the user-agent header). The identification string
SHOULD describe the purpose of the crawler. Here's an example of an SHOULD describe the purpose of the crawler. Here's an example of an
HTTP header with a link pointing to a page describing the purpose of HTTP header with a link pointing to a page describing the purpose of
the ExampleBot crawler which appears both in the HTTP header and as a the ExampleBot crawler which appears both in the HTTP header and as a
product token: product token:
+-------------------------------------------------+-----------------+ +===================================+=================+
| HTTP header | robots.txt | | HTTP header | robots.txt |
| | user-agent line | | | user-agent line |
+-------------------------------------------------+-----------------+ +===================================+=================+
| user-agent: Mozilla/5.0 (compatible; | user-agent: | | user-agent: Mozilla/5.0 | user-agent: |
| ExampleBot/0.1; | ExampleBot | | (compatible; ExampleBot/0.1; | ExampleBot |
| https://www.example.com/bot.html) | | | https://www.example.com/bot.html) | |
+-------------------------------------------------+-----------------+ +-----------------------------------+-----------------+
Table 1: Example of a user-agent header and user-
agent robots.txt token for ExampleBot
Crawlers MUST find the group that matches the product token exactly, Crawlers MUST find the group that matches the product token exactly,
and then obey the rules of the group. If there is more than one and then obey the rules of the group. If there is more than one
group matching the user-agent, the matching groups' rules MUST be group matching the user-agent, the matching groups' rules MUST be
combined into one group. The matching MUST be case-insensitive. If combined into one group. The matching MUST be case-insensitive. If
no matching group exists, crawlers MUST obey the first group with a no matching group exists, crawlers MUST obey the first group with a
user-agent line with a "*" value, if present. If no group satisfies user-agent line with a "*" value, if present. If no group satisfies
either condition, or no groups are present at all, no rules apply. either condition, or no groups are present at all, no rules apply.
2.2.2. The Allow and Disallow lines 2.2.2. The Allow and Disallow Lines
These lines indicate whether accessing a URI that matches the These lines indicate whether accessing a URI that matches the
corresponding path is allowed or disallowed. corresponding path is allowed or disallowed.
To evaluate if access to a URI is allowed, a robot MUST match the To evaluate if access to a URI is allowed, a robot MUST match the
paths in allow and disallow rules against the URI. The matching paths in allow and disallow rules against the URI. The matching
SHOULD be case sensitive. The most specific match found MUST be SHOULD be case sensitive. The most specific match found MUST be
used. The most specific match is the match that has the most octets. used. The most specific match is the match that has the most octets.
If an allow and disallow rule is equivalent, the allow SHOULD be If an allow and disallow rule is equivalent, the allow SHOULD be
used. If no match is found amongst the rules in a group for a used. If no match is found amongst the rules in a group for a
skipping to change at page 5, line 22 skipping to change at page 6, line 19
If a percent-encoded US-ASCII octet is encountered in the URI, it If a percent-encoded US-ASCII octet is encountered in the URI, it
MUST be unencoded prior to comparison, unless it is a reserved MUST be unencoded prior to comparison, unless it is a reserved
character in the URI as defined by [RFC3986] or the character is character in the URI as defined by [RFC3986] or the character is
outside the unreserved character range. The match evaluates outside the unreserved character range. The match evaluates
positively if and only if the end of the path from the rule is positively if and only if the end of the path from the rule is
reached before a difference in octets is encountered. reached before a difference in octets is encountered.
For example: For example:
+-------------------+-----------------------+-----------------------+ +===================+======================+======================+
| Path | Encoded Path | Path to match | | Path | Encoded Path | Path to Match |
+-------------------+-----------------------+-----------------------+ +===================+======================+======================+
| /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz | | /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz |
| | | | +-------------------+----------------------+----------------------+
| /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% | | /foo/bar?baz=http | /foo/bar?baz=http%3A | /foo/bar?baz=http%3A |
| ://foo.bar | 2F%2Ffoo.bar | 2F%2Ffoo.bar | | ://foo.bar | %2F%2Ffoo.bar | %2F%2Ffoo.bar |
| | | | +-------------------+----------------------+----------------------+
| /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 |
| | | | +-------------------+----------------------+----------------------+
| /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | /foo/ | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 |
| 4 | | | | bar/%E3%83%84 | | |
| | | | +-------------------+----------------------+----------------------+
| /foo/bar/%62%61%7 | /foo/bar/%62%61%7A | /foo/bar/baz | | /foo/ | /foo/bar/%62%61%7A | /foo/bar/baz |
| A | | | | bar/%62%61%7A | | |
+-------------------+-----------------------+-----------------------+ +-------------------+----------------------+----------------------+
Table 2: Examples of matching percent-encoded URI components
The crawler SHOULD ignore "disallow" and "allow" rules that are not The crawler SHOULD ignore "disallow" and "allow" rules that are not
in any group (for example, any rule that precedes the first user- in any group (for example, any rule that precedes the first user-
agent line). agent line).
Implementers MAY bridge encoding mismatches if they detect that the Implementers MAY bridge encoding mismatches if they detect that the
robots.txt file is not UTF8 encoded. robots.txt file is not UTF8 encoded.
2.2.3. Special characters 2.2.3. Special Characters
Crawlers SHOULD allow the following special characters: Crawlers SHOULD allow the following special characters:
+-----------+--------------------------------+----------------------+ +===========+===================+==============================+
| Character | Description | Example | | Character | Description | Example |
+-----------+--------------------------------+----------------------+ +===========+===================+==============================+
| "#" | Designates an end of line | "allow: / # comment | | "#" | Designates an end | "allow: / # comment in line" |
| | comment. | in line" | | | of line comment. | |
| | | | | | | "# comment on its own line" |
| | | "# comment at the | +-----------+-------------------+------------------------------+
| | | end" | | "$" | Designates the | "allow: /this/path/exactly$" |
| | | | | | end of the match | |
| "$" | Designates the end of the | "allow: | | | pattern. | |
| | match pattern. A URI MUST end | /this/path/exactly$" | +-----------+-------------------+------------------------------+
| | with a $. | | | "*" | Designates 0 or | "allow: /this/*/exactly" |
| | | | | | more instances of | |
| "*" | Designates 0 or more instances | "allow: | | | any character. | |
| | of any character. | /this/*/exactly" | +-----------+-------------------+------------------------------+
+-----------+--------------------------------+----------------------+
Table 3: List of special characters in robots.txt files
If crawlers match special characters verbatim in the URI, crawlers If crawlers match special characters verbatim in the URI, crawlers
SHOULD use "%" encoding. For example: SHOULD use "%" encoding. For example:
+------------------------+------------------------------------------+ +============================+===============================+
| Pattern | URI | | Percent-encoded Pattern | URI |
+------------------------+------------------------------------------+ +============================+===============================+
| /path/file- | https://www.example.com/path/file- | | /path/file-with-a-%2A.html | https://www.example.com/path/ |
| with-a-%2A.html | with-a-*.html | | | file-with-a-*.html |
| | | +----------------------------+-------------------------------+
| /path/foo-%24 | https://www.example.com/path/foo-$ | | /path/foo-%24 | https://www.example.com/path/ |
+------------------------+------------------------------------------+ | | foo-$ |
+----------------------------+-------------------------------+
2.2.4. Other records Table 4: Example of percent-encoding
2.2.4. Other Records
Clients MAY interpret other records that are not part of the Clients MAY interpret other records that are not part of the
robots.txt protocol. For example, 'sitemap' [2]. Parsing of other robots.txt protocol. For example, 'sitemap' [SITEMAPS]. Parsing of
records MUST NOT interfere with the parsing of explicitly defined other records MUST NOT interfere with the parsing of explicitly
records in section 2. defined records in section 2.
2.3. Access method 2.3. Access Method
The rules MUST be accessible in a file named "/robots.txt" (all lower The rules MUST be accessible in a file named "/robots.txt" (all lower
case) in the top level path of the service. The file MUST be UTF-8 case) in the top level path of the service. The file MUST be UTF-8
encoded (as defined in [RFC3629]) and Internet Media Type "text/ encoded (as defined in [RFC3629]) and Internet Media Type "text/
plain" (as defined in [RFC2046]). plain" (as defined in [RFC2046]).
As per [RFC3986], the URI of the robots.txt is: As per [RFC3986], the URI of the robots.txt is:
"scheme:[//authority]/robots.txt" "scheme:[//authority]/robots.txt"
For example, in the context of HTTP or FTP, the URI is: For example, in the context of HTTP or FTP, the URI is:
http://www.example.com/robots.txt http://www.example.com/robots.txt
https://www.example.com/robots.txt
ftp://ftp.example.com/robots.txt https://www.example.com/robots.txt
2.3.1. Access results ftp://ftp.example.com/robots.txt
2.3.1.1. Successful access 2.3.1. Access Results
2.3.1.1. Successful Access
If the crawler successfully downloads the robots.txt, the crawler If the crawler successfully downloads the robots.txt, the crawler
MUST follow the parseable rules. MUST follow the parseable rules.
2.3.1.2. Redirects 2.3.1.2. Redirects
The server may respond to a robots.txt fetch request with a redirect, The server may respond to a robots.txt fetch request with a redirect,
such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least
five consecutive redirects, even across authorities (for example five consecutive redirects, even across authorities (for example,
hosts in case of HTTP), as defined in [RFC1945]. hosts in case of HTTP), as defined in [RFC1945].
If a robots.txt file is reached within five consecutive redirects, If a robots.txt file is reached within five consecutive redirects,
the robots.txt file MUST be fetched, parsed, and its rules followed the robots.txt file MUST be fetched, parsed, and its rules followed
in the context of the initial authority. in the context of the initial authority.
If there are more than five consecutive redirects, crawlers MAY If there are more than five consecutive redirects, crawlers MAY
assume that the robots.txt is unavailable. assume that the robots.txt is unavailable.
2.3.1.3. Unavailable status 2.3.1.3. Unavailable Status
Unavailable means the crawler tries to fetch the robots.txt, and the Unavailable means the crawler tries to fetch the robots.txt, and the
server responds with unavailable status codes. For example, in the server responds with unavailable status codes. For example, in the
context of HTTP, unavailable status codes are in the 400-499 range. context of HTTP, unavailable status codes are in the 400-499 range.
If a server status code indicates that the robots.txt file is If a server status code indicates that the robots.txt file is
unavailable to the client, then crawlers MAY access any resources on unavailable to the client, then crawlers MAY access any resources on
the server or MAY use a cached version of a robots.txt file for up to the server.
24 hours.
2.3.1.4. Unreachable status 2.3.1.4. Unreachable Status
If the robots.txt is unreachable due to server or network errors, If the robots.txt is unreachable due to server or network errors,
this means the robots.txt is undefined and the crawler MUST assume this means the robots.txt is undefined and the crawler MUST assume
complete disallow. For example, in the context of HTTP, an complete disallow. For example, in the context of HTTP, an
unreachable robots.txt has a response code in the 500-599 range. For unreachable robots.txt has a response code in the 500-599 range. For
other undefined status codes, the crawler MUST assume the robots.txt other undefined status codes, the crawler MUST assume the robots.txt
is unreachable. is unreachable.
If the robots.txt is undefined for a reasonably long period of time If the robots.txt is undefined for a reasonably long period of time
(for example, 30 days), clients MAY assume the robots.txt is (for example, 30 days), clients MAY assume the robots.txt is
unavailable or continue to use a cached copy. unavailable or continue to use a cached copy.
2.3.1.5. Parsing errors 2.3.1.5. Parsing Errors
Crawlers SHOULD try to parse each line of the robots.txt file. Crawlers SHOULD try to parse each line of the robots.txt file.
Crawlers MUST use the parseable rules. Crawlers MUST use the parseable rules.
2.4. Caching 2.4. Caching
Crawlers MAY cache the fetched robots.txt file's contents. Crawlers Crawlers MAY cache the fetched robots.txt file's contents. Crawlers
MAY use standard cache control as defined in [RFC2616]. Crawlers MAY use standard cache control as defined in [RFC2616]. Crawlers
SHOULD NOT use the cached version for more than 24 hours, unless the SHOULD NOT use the cached version for more than 24 hours, unless the
robots.txt is unreachable. robots.txt is unreachable.
2.5. Limits 2.5. Limits
Crawlers MAY impose a parsing limit that MUST be at least 500 Crawlers MAY impose a parsing limit that MUST be at least 500
kibibytes (KiB). kibibytes (KiB).
2.6. Security Considerations 3. Security Considerations
The Robots Exclusion Protocol is not a substitute for more valid The Robots Exclusion Protocol is not a substitute for more valid
content security measures. Listing URIs in the robots.txt file content security measures. Listing URIs in the robots.txt file
exposes the URI publicly and thus making the URIs discoverable. exposes the URI publicly and thus makes the URIs discoverable.
2.7. IANA Considerations. 4. IANA Considerations
This document has no actions for IANA. This document has no actions for IANA.
3. Examples 5. Examples
3.1. Simple example 5.1. Simple Example
The following example shows: The following example shows:
o *foobot*: A regular case. A single user-agent token followed by * foobot: A regular case. A single user-agent token followed by
rules. rules.
o *barbot and bazbot*: A group that's relevant for more than one
user-agent.
o *quxbot:* Empty group at end of file.
<CODE BEGINS> * barbot and bazbot: A group that's relevant for more than one user-
User-Agent : foobot agent.
Disallow : /example/page.html
Disallow : /example/disallowed.gif
User-Agent : barbot * quxbot: An empty group at end of the file.
User-Agent : bazbot
Allow : /example/page.html
Disallow : /example/disallowed.gif
User-Agent: quxbot User-Agent : foobot
Disallow : /example/page.html
Disallow : /example/disallowed.gif
EOF User-Agent : barbot
<CODE ENDS> User-Agent : bazbot
Allow : /example/page.html
Disallow : /example/disallowed.gif
3.2. Longest Match User-Agent: quxbot
EOF
5.2. Longest Match
The following example shows that in the case of two rules, the The following example shows that in the case of two rules, the
longest one is used for matching. In the following case, longest one is used for matching. In the following case,
/example/page/disallowed.gif MUST be used for the URI /example/page/disallowed.gif MUST be used for the URI
example.com/example/page/disallow.gif . example.com/example/page/disallow.gif.
<CODE BEGINS> User-Agent : foobot
User-Agent : foobot Allow : /example/page/
Allow : /example/page/ Disallow : /example/page/disallowed.gif
Disallow : /example/page/disallowed.gif
<CODE ENDS>
4. References 6. References
4.1. Normative References 6.1. Normative References
[RFC1945] Berners-Lee, T., Fielding, R., and H. Frystyk,
"Hypertext Transfer Protocol -- HTTP/1.0", RFC 1945, [RFC1945] Berners-Lee, T., Fielding, R., and H. Frystyk, "Hypertext
May 1996. Transfer Protocol -- HTTP/1.0", RFC 1945,
[RFC2046] Freed, N., Borenstein, N., "Multipurpose Internet Mail DOI 10.17487/RFC1945, May 1996,
<https://www.rfc-editor.org/info/rfc1945>.
[RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part Two: Media Types", RFC 2046, Extensions (MIME) Part Two: Media Types", RFC 2046,
November 1996. DOI 10.17487/RFC2046, November 1996,
<https://www.rfc-editor.org/info/rfc2046>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
Masinter, L., Leach, P., Berners-Lee, T., "Hypertext Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. Transfer Protocol -- HTTP/1.1", RFC 2616,
DOI 10.17487/RFC2616, June 1999,
<https://www.rfc-editor.org/info/rfc2616>.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, November 2003. 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
2003, <https://www.rfc-editor.org/info/rfc3629>.
[RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
Resource Identifier (URI): Generic Syntax", STD 66, Resource Identifier (URI): Generic Syntax", STD 66,
RFC 3986, January 2005. RFC 3986, DOI 10.17487/RFC3986, January 2005,
[RFC5234] Crocker, D., Overell, P., "Augmented BNF for Syntax <https://www.rfc-editor.org/info/rfc3986>.
Specifications: ABNF", RFC 5234, STD 68, January 2008.
[RFC8174] Leiba, B., "Ambiquity of Uppercase vs Lowercase in RFC [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
2119 Key Words", BCP 14, RFC 2119, Mat 2017. Specifications: ABNF", STD 68, RFC 5234,
[RFC8288] Nottingham, M., "Web Linking", RFC 8288, October 2017. DOI 10.17487/RFC5234, January 2008,
<https://www.rfc-editor.org/info/rfc5234>.
4.2. Informative References [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
[1] http://www.robotstxt.org/ [RFC8288] Nottingham, M., "Web Linking", RFC 8288,
DOI 10.17487/RFC8288, October 2017,
<https://www.rfc-editor.org/info/rfc8288>.
[2] https://www.sitemaps.org/index.html 6.2. Informative References
Authors' Address [ROBOTSTXT]
"Robots Exclusion Protocol", n.d.,
<http://www.robotstxt.org/>.
Martijn Koster [SITEMAPS] "Sitemaps Protocol", n.d.,
Stalworthy Manor Farm <https://www.sitemaps.org/index.html>.
Suton Lane, NR18 9JG
Authors' Addresses
Martijn Koster (editor)
Stalworthy Computing, Ltd.
Suton Lane
Wymondham, Norfolk Wymondham, Norfolk
NR18 9JG
United Kingdom United Kingdom
Email: m.koster@greenhills.co.uk Email: m.koster@greenhills.co.uk
Gary Illyes (editor)
Gary Illyes Google LLC.
Brandschenkestrasse 110 Brandschenkestrasse 110
8002, Zurich CH-8002 Zurich
Switzerland Switzerland
Email: garyillyes@google.com Email: garyillyes@google.com
Henner Zeller Henner Zeller (editor)
Google LLC.
1600 Amphitheatre Pkwy 1600 Amphitheatre Pkwy
Mountain View, CA 94043 Mountain View, CA, 94043
USA United States of America
Email: henner@google.com Email: henner@google.com
Lizzi Harvey Lizzi Sassman (editor)
1600 Amphitheatre Pkwy Google LLC.
Mountain View, CA 94043 Brandschenkestrasse 110
USA CH-8002 Zurich
Switzerland
Email: lizzi@google.com Email: lizzi@google.com
 End of changes. 77 change blocks. 
216 lines changed or deleted 252 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/