idnits 2.17.1 draft-wilde-text-fragment-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 85 has weird spacing: '... allow forma...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 11, 2002) is 7957 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO9945-2' ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986) ** Obsolete normative reference: RFC 2646 (Obsoleted by RFC 3676) -- Obsolete informational reference (is this intentional?): RFC 2629 (Obsoleted by RFC 7749) Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group E. Wilde 3 Internet-Draft Swiss Federal Institute of 4 Expires: January 9, 2003 Technology 5 July 11, 2002 7 URI Fragment Identifiers for the text/plain Media Type 8 draft-wilde-text-fragment-00 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC2026. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six months 21 and may be updated, replaced, or obsoleted by other documents at any 22 time. It is inappropriate to use Internet-Drafts as reference 23 material or to cite them other than as "work in progress." 25 The list of current Internet-Drafts can be accessed at http:// 26 www.ietf.org/ietf/1id-abstracts.txt. 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html. 31 This Internet-Draft will expire on January 9, 2003. 33 Copyright Notice 35 Copyright (C) The Internet Society (2002). All Rights Reserved. 37 Abstract 39 This memo defines URI fragment identifiers for text/plain resources. 40 These fragment identifiers make it possible to refer to parts of a 41 text resource, identified by character count or range, line count or 42 range, or a regular expression. These identification methods can be 43 combined to identify more than one sub-resource of a text/plain 44 resource. 46 Table of Contents 48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 49 1.1 What is text/plain? . . . . . . . . . . . . . . . . . . . . 3 50 1.2 What is a URI Fragment Identifier? . . . . . . . . . . . . . 3 51 1.3 Why text/plain Fragment Identifiers? . . . . . . . . . . . . 4 52 2. Fragment Identification Methods . . . . . . . . . . . . . . 4 53 2.1 Fragment Identification Schemes . . . . . . . . . . . . . . 5 54 2.1.1 Character Count . . . . . . . . . . . . . . . . . . . . . . 5 55 2.1.2 Character Range . . . . . . . . . . . . . . . . . . . . . . 5 56 2.1.3 Line Count . . . . . . . . . . . . . . . . . . . . . . . . . 5 57 2.1.4 Line Range . . . . . . . . . . . . . . . . . . . . . . . . . 5 58 2.1.5 Regular Expressions . . . . . . . . . . . . . . . . . . . . 5 59 2.1.6 Combining Fragment Identification Schemes . . . . . . . . . 6 60 3. Fragment Identification Syntax . . . . . . . . . . . . . . . 6 61 4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 7 62 5. Security Considerations . . . . . . . . . . . . . . . . . . 8 63 Normative References . . . . . . . . . . . . . . . . . . . . 8 64 Non-Normative References . . . . . . . . . . . . . . . . . . 8 65 Author's Address . . . . . . . . . . . . . . . . . . . . . . 9 66 A. POSIX BRE Syntax . . . . . . . . . . . . . . . . . . . . . . 9 67 B. Where to send Comments . . . . . . . . . . . . . . . . . . . 9 68 C. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 69 Full Copyright Statement . . . . . . . . . . . . . . . . . . 10 71 1. Introduction 73 Compliant software MUST follow this specification. The capitalized 74 key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 75 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 76 document are to be interpreted as described in RFC 2119 [RFC2119]. 78 1.1 What is text/plain? 80 Internet Media Types as defined in RFC 2045 [RFC2045] and RFC 2046 81 [RFC2046] are used to identify different types and sub-types of 82 media. RFC 2046 [RFC2046] and RFC 2646 [RFC2646] specify the text/ 83 plain media type, which is used for simple, unformatted text. 84 Quoting from RFC 2046 [RFC2046]: "Plain text does not provide for or 85 allow formatting commands, font attribute specifications, processing 86 instructions, interpretation directives, or content markup. Plain 87 text is seen simply as a linear sequence of characters, possibly 88 interrupted by line breaks or page breaks." 90 The text/plain media type does not restrict the character encoding, 91 any character encoding may be used. In the absence of an explicit 92 character encoding declaration, US-ASCII is assumed as the default 93 character encoding. This variability of the character encoding makes 94 it impossible to count characters in a text/plain resource without 95 taking the character encoding into account, because there are many 96 character encodings using more than one octet per character. 98 The biggest advantage of text/plain resources is their portability 99 among different platforms. As long as they use popular character 100 encodings (such as US-ASCII), they can be displayed and processed on 101 virtually every computer system. 103 1.2 What is a URI Fragment Identifier? 105 URIs are the identification mechanism for resources on the Web. The 106 URI syntax specified in RFC 2396 [RFC2396] includes as part of a URI 107 reference a fragment identifier, which (quoting from RFC 2396 108 [RFC2396]) "consists of additional reference information to be 109 interpreted by the user agent after the retrieval action has been 110 successfully completed. As such, it is not part of a URI, but is 111 often used in conjunction with a URI". 113 The most popular fragment identifier is defined for text/html 114 (defined in RFC 2854 [RFC2854]), and makes it possible to refer to a 115 specific element of an HTML document. 117 1.3 Why text/plain Fragment Identifiers? 119 Referring to specific parts of a resource can be very useful, because 120 it enables users to create more specific references. Rather than 121 pointing to a whole resource, users can create references to the part 122 they really are interested in or want to talk about. Even though it 123 is suggested that fragment identification methods are specified in a 124 media type's MIME registration, many media types do not have fragment 125 identification methods associated with them. 127 Fragment identifiers are only useful if supported by the client, 128 because they are only interpreted by the client. Therefore, a new 129 fragment identification method will require some time to be adopted 130 by clients, and older clients will not support it. However, because 131 the URI reference still works even if the fragment identifier is not 132 supported (the resource is retrieved, but the fragment identifier is 133 not interpreted), rapid adoption is not highly critical to ensure the 134 success of a new fragment identification method. 136 Fragment identifiers for text/plain make it possible to refer to 137 specific parts of a text resource, either by line count, by character 138 count, or by using a regular expression for searching for a specific 139 character sequence. Thus, text/plain fragment identifiers enable 140 users to exchange information more specifically, thereby reducing 141 time and effort that is necessary to manually search for the relevant 142 part of a text/plain resource. 144 2. Fragment Identification Methods 146 The identification of resource fragments of text/plain resources can 147 be based on different foundations. Since it is not necessary to 148 insert explicit identifiers into a text/plain resource (as is 149 possible with HTML documents by using special attributes), fragment 150 identification has to rely on certain inherent criteria of the 151 resource. This memo specifies fragment identification using five 152 different methods, character counts and ranges, line counts and 153 ranges, and regular expression matching. 155 When interpreting character or line numbers, implementations MUST 156 take the character encoding of the resource into account, because 157 character count and octet count may differ for the character encoding 158 being used. For example, a resource using UTF-16 encoding uses two 159 octets per character, and it may have a leading BOM (Byte-Order Mark) 160 which does not count as a character and thus also affects the mapping 161 from a simple octet count to a character count. 163 2.1 Fragment Identification Schemes 165 2.1.1 Character Count 167 The simplest way to identify a fragment is to point to a certain 168 character of the resource. Rather than identifying a fragment 169 consisting of a number of characters, this method only identifies a 170 single character, but this often is sufficient by referring to the 171 start of a region of interest. Character counting starts with 1, so 172 the first character of a text/plain resource has the count 1. 174 2.1.2 Character Range 176 If it is necessary to identify a fragment of multiple characters 177 using character counting, this can be done by using a character 178 range. A character range is a consecutive region of the resource 179 that extends from the starting character of the range to the ending 180 character of the range. The ending character of the range must have 181 a greater number than the starting character. 183 2.1.3 Line Count 185 Lines in text/plain resources are separated by CRLF sequences, and 186 consequently it is easy to identify lines. Because lines are the 187 only structural property of text/plain resources, it is possible to 188 identify a fragment of a resource by referring to a particular line. 189 Line counting starts with 1, so the first line of a text/plain 190 resource has the count 1. If a resource does not contain any CRLF 191 sequences, then it consists of a single (the first) line. 193 2.1.4 Line Range 195 If it is necessary to identify a fragment of multiple lines using 196 line counting, this can be done by using a line range. A line range 197 is a consecutive region of the resource that extends from the 198 starting line of the range to the ending line of the range. The 199 ending line of the range must have a greater number than the starting 200 line. 202 2.1.5 Regular Expressions 204 A common problem with fragment identifiers is their robustness (to 205 changes in the resource), and character and line counts can be broken 206 very easily. A more robust way of identifying a fragment is by 207 searching for a specific pattern. Thus, it is possible to use a 208 Basic Regular Expression (BRE) as defined by ISO 9945-2 [ISO9945-2] 209 (the POSIX standard) as a fragment identifier (Appendix A contains a 210 short summary of the POSIX BRE syntax). 212 2.1.6 Combining Fragment Identification Schemes 214 While in most cases only one fragment identification scheme will be 215 used, it is possible to combine them. By simply concatenating 216 different fragment identification schemes, the whole fragment 217 identifier refers to the union of all parts of the text resource 218 identified by the individual fragment identification schemes. This 219 way, it is possible to identify disjoint ranges, such as multiple 220 line ranges. 222 It should be noticed that regular expressions by themselves may 223 identify disjoint fragments, which is true in any case where the 224 regular expression matches more than one occurrence in the resource. 226 Since disjoint fragments can be identified, implementations SHOULD 227 make sure that these fragments are appropriately marked, for example 228 by highlighting the fragment (rather than only scrolling to some 229 line, which only identifies a single location in the resource). 230 However, the exact method of how implementations deal with disjoint 231 fragments depends on the application and interface, and is beyond the 232 scope of this memo. 234 3. Fragment Identification Syntax 236 The syntax for the fragment identifiers is very straightforward. The 237 syntax defines three schemes, 'char', 'line', and 'match'. The 238 'char' and 'line' can be used in two different variants, either the 239 count variant (with a single number), or the range variant (with two 240 comma-separated numbers). The 'match' scheme has a regular 241 expression as parameter, which must be specified as a string with 242 balanced parentheses. 244 The following syntax definition uses ABNF as defined in RFC 2234 245 [RFC2234]. 247 text-fragment = 1*text-scheme 248 text-scheme = ( char-scheme / line-scheme / regex-scheme ) 249 char-scheme = "char(" ( count / range ) ")" 250 line-scheme = "line(" ( count / range ) ")" 251 match-scheme = "match(" regex ")" 252 count = 1*DIGIT 253 range = count "," count 254 regex = StringWithBalancedParens 256 The StringWithBalancedParens may only contain balanced parentheses, 257 if unbalanced parentheses need to be used, they must be escaped with 258 a '^' character. A literal '^' must be escaped as '^^'. Thus, 259 before interpreting the StringWithBalancedParens as a BRE, it must be 260 searched for '^(', '^)', and '^^', and these strings must be 261 substituted with their unescaped variants. 263 If any count value is greater than the value for the actual resource, 264 then it identifies the last character or line of the resource. If a 265 range scheme's counts are not properly ordered (ie, the first number 266 is less than the second), then this scheme part has to be ignored. 268 4. Examples 270 The following examples show some usages for the fragment identifiers 271 defined in this memo. 273 http://example.com/text.txt#char(100) 275 This URI reference identifies the 100th character of the text.txt 276 resource. It should be noted that it is not clear which octet(s) of 277 the resource this will be without retrieving the resource and thus 278 knowing which character encoding is used for it (in case of HTTP, 279 this information will be given in the response's Content-type 280 header). 282 http://example.com/text.txt#line(10,20) 284 This URI reference identifies lines 10 to 20 of the text.txt 285 resource. If the resource has fewer than 10 lines, it identifies the 286 last line. If the resource has less than 20 but at least 10 lines, 287 it identifies the lines 10 to the last line of the resource. 289 http://example.com/text.txt#match(searchterm) 291 This URI reference identifies all occurrences of the regular 292 expression 'searchterm' in the resource, ie all occurrences of the 293 string 'searchterm'. If there is more than one occurrence, then this 294 URI references a disjoint fragment, consisting of all of these 295 occurrences. 297 http://example.com/text.txt#line(1)match(searchterm) 299 This URI reference identifies the first line and all occurrences of 300 the regular expression 'searchterm' in the resource. If there is an 301 occurrence of 'searchterm' outside of the first line, then this URI 302 references a disjoint fragment. 304 http://example.com/text.txt#match(hello%5E() 306 This URI reference identifies all occurrences of the regular 307 expression 'hello(' in the resource. It must first be URL decoded, 308 which leads to the scheme part 'hello^('. This is then interpreted 309 according to the definition of a string with balanced parentheses, 310 treating the '^(' as an escaped '(', so that the actual regular 311 expression is 'hello('. If there is more than one occurrence of this 312 regular expression, then this URI references a disjoint fragment, 313 consisting of all of these occurrences. 315 5. Security Considerations 317 There are no relevant security considerations for this memo. 319 Normative References 321 [ISO9945-2] International Organization for Standardization, 322 "Information technology - Portable Operating System 323 Interface (POSIX) - Part 2: Shell and Utilities", ISO 324 9945-2, xxxxx 1993. 326 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 327 Extensions (MIME) Part One: Format of Internet Message 328 Bodies", RFC 2045, November 1996. 330 [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 331 Extensions (MIME) Part Two: Media Types", RFC 2046, 332 November 1996. 334 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 335 Requirement Levels", RFC 2119, March 1997. 337 [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 338 Specifications: ABNF", RFC 2234, November 1997. 340 [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform 341 Resource Identifiers (URI): Generic Syntax", RFC 2396, 342 August 1998. 344 [RFC2646] Gellens, R., "The Text/Plain Format Parameter", RFC 345 2646, August 1999. 347 Non-Normative References 349 [RFC2629] Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, 350 June 1999. 352 [RFC2854] Connolly, D. and L. Masinter, "The 'text/html' Media 353 Type", RFC 2854, June 2000. 355 Author's Address 357 Erik Wilde 358 Swiss Federal Institute of Technology 359 ETH-Zentrum 360 8092 Zurich 361 Switzerland 363 Phone: +41-1-6325132 364 EMail: ietf@dret.net 365 URI: http://dret.net/netdret/ 367 Appendix A. POSIX BRE Syntax 369 This section contains a short (and non-normative) summary of the 370 POSIX BRE synatx defined in ISO 9945-2 [ISO9945-2]. 372 (tbd - is there some rfc that could be referenced instead?) 374 Appendix B. Where to send Comments 376 Please send all comments about this document to Erik Wilde. 378 Appendix C. Acknowledgements 380 This document has been written using the IETF document DTD described 381 in RFC 2629 [RFC2629]. 383 Full Copyright Statement 385 Copyright (C) The Internet Society (2002). All Rights Reserved. 387 This document and translations of it may be copied and furnished to 388 others, and derivative works that comment on or otherwise explain it 389 or assist in its implementation may be prepared, copied, published 390 and distributed, in whole or in part, without restriction of any 391 kind, provided that the above copyright notice and this paragraph are 392 included on all such copies and derivative works. However, this 393 document itself may not be modified in any way, such as by removing 394 the copyright notice or references to the Internet Society or other 395 Internet organizations, except as needed for the purpose of 396 developing Internet standards in which case the procedures for 397 copyrights defined in the Internet Standards process must be 398 followed, or as required to translate it into languages other than 399 English. 401 The limited permissions granted above are perpetual and will not be 402 revoked by the Internet Society or its successors or assigns. 404 This document and the information contained herein is provided on an 405 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 406 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 407 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 408 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 409 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 411 Acknowledgement 413 Funding for the RFC Editor function is currently provided by the 414 Internet Society.