idnits 2.17.1 draft-wilde-text-fragment-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 898. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 909. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 916. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 922. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == It seems as if not all pages are separated by form feeds - found 0 form feeds but 20 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document updates RFC2046, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). (Using the creation date from RFC2046, updated by this document, for RFC5378 checks: 1995-04-14) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (Jan 17, 2007) is 6308 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '6' ** Obsolete normative reference: RFC 4234 (ref. '7') (Obsoleted by RFC 5234) ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref. '10') -- Obsolete informational reference (is this intentional?): RFC 4288 (ref. '12') (Obsoleted by RFC 6838) Summary: 5 errors (**), 0 flaws (~~), 3 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group E. Wilde 3 Internet-Draft UC Berkeley 4 Updates: 2046 (if approved) M. Duerst 5 Intended status: Standards Track Aoyama Gakuin University 6 Expires: July 21, 2007 Jan 17, 2007 8 URI Fragment Identifiers for the text/plain Media Type 9 draft-wilde-text-fragment-06 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on July 21, 2007. 36 Copyright Notice 38 Copyright (C) The Internet Society (2007). 40 Abstract 42 This memo defines URI fragment identifiers for text/plain MIME 43 entities. These fragment identifiers make it possible to refer to 44 parts of a text/plain MIME entity, identified by character count or 45 range, line count or range, or a regular expression. These 46 identification methods can be combined to identify more than one sub- 47 resource of a text/plain MIME entity. Fragment identifiers may also 48 contain hash information to make them more robust. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 53 1.1. What is text/plain? . . . . . . . . . . . . . . . . . . . 3 54 1.2. What is a URI Fragment Identifier? . . . . . . . . . . . . 3 55 1.3. Why text/plain Fragment Identifiers? . . . . . . . . . . . 4 56 1.4. Incremental Deployment . . . . . . . . . . . . . . . . . . 5 57 1.5. Notation Used in this Memo . . . . . . . . . . . . . . . . 5 58 2. Fragment Identification Methods . . . . . . . . . . . . . . . 5 59 2.1. Fragment Identification Principles . . . . . . . . . . . . 6 60 2.1.1. Positions and Ranges . . . . . . . . . . . . . . . . . 6 61 2.1.2. Characters and Lines . . . . . . . . . . . . . . . . . 7 62 2.2. Combining the Principles . . . . . . . . . . . . . . . . . 7 63 2.2.1. Character Position . . . . . . . . . . . . . . . . . . 7 64 2.2.2. Character Range . . . . . . . . . . . . . . . . . . . 7 65 2.2.3. Line Position . . . . . . . . . . . . . . . . . . . . 8 66 2.2.4. Line Range . . . . . . . . . . . . . . . . . . . . . . 8 67 2.3. Regular Expressions . . . . . . . . . . . . . . . . . . . 8 68 2.4. Combining Fragment Identification Scheme Parts . . . . . . 8 69 2.5. Fragment Identifier Robustness . . . . . . . . . . . . . . 9 70 3. Fragment Identification Syntax . . . . . . . . . . . . . . . . 9 71 3.1. Non-ASCII Characters in Regular Expressions . . . . . . . 10 72 3.2. Hash Sums . . . . . . . . . . . . . . . . . . . . . . . . 11 73 4. Fragment Identifier Processing . . . . . . . . . . . . . . . . 11 74 4.1. Handling of Line Endings in text/plain MIME Entities . . . 11 75 4.2. Handling of Position Values . . . . . . . . . . . . . . . 12 76 4.3. Handling of Hash Sums . . . . . . . . . . . . . . . . . . 12 77 4.4. Syntax Errors in Fragment Identifiers . . . . . . . . . . 12 78 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 79 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 80 7. Security Considerations . . . . . . . . . . . . . . . . . . . 14 81 8. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 15 82 8.1. From -05 to -06 . . . . . . . . . . . . . . . . . . . . . 15 83 8.2. From -04 to -05 . . . . . . . . . . . . . . . . . . . . . 16 84 8.3. From -03 to -04 . . . . . . . . . . . . . . . . . . . . . 16 85 8.4. From -02 to -03 . . . . . . . . . . . . . . . . . . . . . 17 86 8.5. From -01 to -02 . . . . . . . . . . . . . . . . . . . . . 17 87 8.6. From -00 to -01 . . . . . . . . . . . . . . . . . . . . . 17 88 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 89 9.1. Normative References . . . . . . . . . . . . . . . . . . . 18 90 9.2. Non-Normative References . . . . . . . . . . . . . . . . . 18 91 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 19 92 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 93 Intellectual Property and Copyright Statements . . . . . . . . . . 20 95 1. Introduction 97 This memo updates the text/plain MIME type defined in RFC 2046 [1] by 98 defining URI fragment identifiers for text/plain MIME entities. This 99 makes it possible to refer to parts of a text/plain MIME entity. 100 Such parts can be identifier by character count or range, line count 101 or range, or a regular expression. Hash information can be added to 102 a fragment identifier to make it more robust. 104 This section gives an introduction to the general concepts of text/ 105 plain MIME entities and URI fragment identifiers, and discusses the 106 need for fragment identifiers for text/plain and deployment issues. 107 Section 2 discusses the principles and methods on which this memo is 108 based. Section 3 gives the syntax, and Section 4 discusses 109 processing of text/plain fragment identifiers. Section 5 shows some 110 examples. 112 1.1. What is text/plain? 114 Internet Media Types as defined in RFC 2045 [2] and RFC 2046 [1] are 115 used to identify different types and sub-types of media. RFC 2046 116 [1] and RFC 3676 [3] specify the text/plain media type, which is used 117 for simple, unformatted text. Quoting from RFC 2046 [1]: "Plain text 118 does not provide for or allow formatting commands, font attribute 119 specifications, processing instructions, interpretation directives, 120 or content markup. Plain text is seen simply as a linear sequence of 121 characters, possibly interrupted by line breaks or page breaks." 123 The text/plain media type does not restrict the character encoding, 124 any character encoding may be used. In the absence of an explicit 125 character encoding declaration, US-ASCII is assumed as the default 126 character encoding. This variability of the character encoding makes 127 it impossible to count characters in a text/plain MIME entity without 128 taking the character encoding into account, because there are many 129 character encodings using more than one octet per character. 131 The biggest advantage of text/plain MIME entities is their ease of 132 use and their portability among different platforms. As long as they 133 use popular character encodings (such as US-ASCII or UTF-8), they can 134 be displayed and processed on virtually every computer system. The 135 only remaining interoperability issue is the representation of line 136 endindings, which is discussed in Section 4.1. 138 1.2. What is a URI Fragment Identifier? 140 URIs are the identification mechanism for resources on the Web. The 141 URI syntax specified in RFC 3986 [4] includes as part of a URI a 142 fragment identifier, separated by a number sign ('#'). The fragment 143 identifier consists of additional reference information to be 144 interpreted by the user agent after the retrieval action has been 145 successfully completed. The semantics of a fragment identifier is a 146 property of the data resulting from a retrieval action, regardless of 147 the type of URI used in the reference. Therefore, the format and 148 interpretation of fragment identifiers is dependent on the media type 149 of the retrieval result. 151 The most popular fragment identifier is defined for text/html 152 (defined in RFC 2854 [11]), and makes it possible to refer to a 153 specific element (identified by the value of a 'name' or 'id' 154 attribute) of an HTML document. 156 1.3. Why text/plain Fragment Identifiers? 158 Referring to specific parts of a resource can be very useful, because 159 it enables users and applications to create more specific references. 160 Rather than pointing to a whole resource, users can create references 161 to the part they really are interested in or want to talk about. 162 Even though it is suggested that fragment identification methods are 163 specified in a media type's MIME registration (see [12]), many media 164 types do not have fragment identification methods associated with 165 them. 167 Fragment identifiers are only useful if supported by the client, 168 because they are only interpreted by the client. Therefore, a new 169 fragment identification method will require some time to be adopted 170 by clients, and older clients will not support it. However, because 171 the URI still works even if the fragment identifier is not supported 172 (the resource is retrieved, but the fragment identifier is not 173 interpreted), rapid adoption is not highly critical to ensure the 174 success of a new fragment identification method. 176 Fragment identifiers for text/plain as defined in this memo make it 177 possible to refer to specific parts of a text/plain MIME entity, 178 using concepts of positions and ranges, which may be applied to 179 characters and lines. They also support locating a fragment by using 180 a regular expression for searching for a specific character sequence. 181 Thus, text/plain fragment identifiers enable users to exchange 182 information more specifically, thereby reducing time and effort that 183 is necessary to manually search for the relevant part of a text/plain 184 MIME entity. 186 The text/plain format does not support the embedding of links, so in 187 normal environments, text/plain resources can only serve as targets 188 for links, and not as sources. However, when combining the text/ 189 plain fragment identifiers specified in this memo with out-of-line 190 linking mechanisms such as XLink [13], it is possible to "embed" link 191 sources into text/plain resources. Thus, the text/plain fragment 192 identifiers specified in this memo open a path for plain/text files 193 to become fully integrated resources in hypermedia systems such as 194 the Web. 196 1.4. Incremental Deployment 198 As long as support for text/plain fragment identifiers is not 199 implemented everywhere, it is important to consider the implications 200 of incremental deployment. Clients (for example, Web browsers) not 201 supporting the text/plain fragment identifier described in this memo 202 will work with URI references to text/plain MIME entities, but they 203 will fail to locate the sub-resource identified by the fragment 204 identifier. This is a reasonable fallback behavior, and in general 205 users should take into account the possibility that a program 206 interpreting a given URI will fail to interpret the fragment 207 identifier part. Since fragment identifier evaluation is local to 208 the client (and happens after retrieving the MIME entity), there is 209 no way for a server to determine whether a requesting client is using 210 a URI containing a fragment identifier. 212 1.5. Notation Used in this Memo 214 The capitalized key words "MUST", "MUST NOT", "REQUIRED", "SHALL", 215 "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and 216 "OPTIONAL" in this document are to be interpreted as described in RFC 217 2119 [5]. 219 2. Fragment Identification Methods 221 The identification of fragments of text/plain MIME entities can be 222 based on different foundations. Since it is not possible to insert 223 explicit, invisible identifiers into a text/plain MIME entity (as for 224 example used in HTML documents, implemented through dedicated 225 attributes), fragment identification has to rely on certain inherent 226 properties of the MIME entity. This memo specifies fragment 227 identification using six different methods, which are character 228 positions and ranges, line positions and ranges, regular expression 229 matching, and a mechanism for improving the robustness of fragment 230 identifiers (entity hashes). 232 When interpreting character or line numbers, implementations MUST 233 take the character encoding of the MIME entity into account, because 234 character count and octet count may differ for the character encoding 235 being used. For example, a MIME entity using UTF-16 encoding (as 236 specified in RFC 2718 [14]) uses two octets per character in most 237 cases, and sometimes four octets per character. It can also have a 238 leading BOM (Byte-Order Mark), which does not count as a character 239 and thus also affects the mapping from a simple octet count to a 240 character count. 242 2.1. Fragment Identification Principles 244 Fragment identification can be done using regular expressions or 245 combining two orthogonal principles, which are positions and ranges, 246 and characters and lines. This section describes the principles 247 themselves, while Section 2.2 describes the combination of the 248 principles. 250 2.1.1. Positions and Ranges 252 A position does not identify an actual fragment of the MIME entity, 253 but a position inside the MIME entity, which can be regarded as a 254 fragment of zero length. The use case for positions is to provide 255 pointers for applications which may use them to implement 256 functionalities such as "insert some text here", which needs a 257 position rather than a fragment. Positions are counted from zero, 258 position zero being before the first character or line of a text/ 259 plain MIME entity. Thus a text/plain MIME entity having one 260 character has two positions, one before the first character (position 261 0), and one after the first character (position 1). 263 Since positions are fragments of length zero, applications SHOULD use 264 other methods than highlighting to indicate positions, the most 265 obvious way being the positioning of a cursor (if the application 266 supports the concept of a cursor). 268 Ranges, on the other hand, identify fragments of a MIME entity that 269 have a length that may be greater than zero. As a general principle 270 for ranges, they specify both a lower and a upper bound. The start 271 or the end of a range specification may be omitted, defaulting to the 272 first repectively last position of the MIME entity. The end of a 273 range must have a value greater than or equal to the start. A range 274 with identical start and end is legal, and identifies a range of 275 length 0, which is equivalent to a position. 277 Applications that support a concept such as highlighting SHOULD use 278 such a concept to indicate fragments of length greater than zero to 279 the user. 281 For positions and ranges it is implicitly assumed that if a number is 282 greater than the actual number of elements in the MIME entity, then 283 it is referring to the last element of the MIME entity (see Section 4 284 for details). 286 2.1.2. Characters and Lines 288 The concept of positions and ranges can be applied to characters or 289 lines. In both cases, positions indicate points between entities, 290 while ranges identify zero or more entities by indicating positions. 292 Character positions are numbered starting with zero (ignoring initial 293 BOM marks or similar concepts that are not part of the actual textual 294 content of a text/plain MIME entity), and counting each character 295 separately, with the exception of line endings, which are always 296 counted as one character (see Section 4.1 for details). 298 Line positions are numbered starting with zero (with line position 299 zero always being identical with character position zero), with 300 Section 4.1 describing how line endings are be identified. Fragments 301 identified by lines include the line endings, so applications 302 identifying line-based fragments MUST include the line endings in the 303 fragment identification they are using (e.g., the highlighted 304 selection). If a MIME entity does not contain any line endings, then 305 it consists of a single (the first) line. 307 2.2. Combining the Principles 309 In the following sections, the principles described in the preceding 310 section (positions/ranges and characters/lines) are combined, 311 resulting in four use cases. The fragment identifier syntax, 312 described in detail in Section 3, uses various schemes for different 313 purposes. 315 2.2.1. Character Position 317 To identify a character position (i.e., a fragment of length zero 318 between two characters), the 'char' scheme followed by a single 319 number is used. Rather than identifying a fragment consisting of a 320 number of characters, this method identifies a position between two 321 characters (or before the first or after the last character). 322 Character position counting starts with 0, so the character position 323 before the first character of a text/plain MIME entity has the 324 character position 0, and a MIME entity containing n distinct 325 characters has n+1 distinct character positions, the last one having 326 the character position n. 328 2.2.2. Character Range 330 To identify a fragment of one or more characters (a character range), 331 the 'char' scheme followed by a range specification is used. A 332 character range is a consecutive region of the MIME entity that 333 extends from the starting character position of the range to the 334 ending character position of the range. 336 2.2.3. Line Position 338 To identify a line position (i.e., a fragment of length zero between 339 two lines), the 'line' scheme followed by a single number is used. 340 Rather than identifying a fragment consisting of a number of lines, 341 this method identifies a position between two lines (or before the 342 first or after the last line). Line position counting starts with 0, 343 so the line position before the first line of a text/plain MIME 344 entity has the line position 0, and a MIME entity containing n 345 distinct lines has n+1 distinct line positions, the last one having 346 the line position n. 348 2.2.4. Line Range 350 To identify a fragment of one or more lines (a line range), the 351 'line' scheme followed by a range specification is used. A line 352 range is a consecutive region of the MIME entity that extends from 353 the starting line position of the range to the ending line position 354 of the range. 356 2.3. Regular Expressions 358 A common problem with fragment identifiers is their robustness (to 359 changes in the MIME entity), and character and line counts can break 360 very easily. A more robust way of identifying a fragment is by 361 searching for a specific pattern. Using the 'match' scheme, it is 362 possible to use a Basic Regular Expression (BRE) as defined by ISO 363 9945-2 [6] (the POSIX standard) as a fragment identifier. For 364 another way of making fragment identifiers more robust, see 365 Section 2.5. 367 2.4. Combining Fragment Identification Scheme Parts 369 In most cases, a fragment identifier will consist of only one 370 fragment identification scheme part. However, by concatenating them, 371 separated with a semicolon, it is possible to use several fragment 372 identification scheme parts in a fragment identifier. The whole 373 fragment identifier refers to the union of all fragments of the text/ 374 plain MIME entity identified by the individual fragment 375 identification scheme parts. In this way, it is possible to identify 376 disjoint ranges, such as multiple line ranges. 378 It should be noticed that regular expressions by themselves may 379 identify disjoint fragments, which is true in any case where the 380 regular expression matches more than one occurrence in the MIME 381 entity. 383 Since disjoint fragments can be identified, implementations SHOULD 384 make sure that these fragments are appropriately marked, for example 385 by highlighting the fragment (rather than only scrolling to some 386 line, which only identifies a single position in the MIME entity). 387 If an implementation can not mark disjoint fragments, it MAY resort 388 to marking only the first of the disjoint fragments. However, the 389 exact method of how implementations deal with disjoint fragments 390 depends on the application and interface, and is beyond the scope of 391 this memo. 393 2.5. Fragment Identifier Robustness 395 While regular expressions (as described in Section 2.3) may make 396 fragment identifiers more robust than character or line counts, it is 397 still possible that modifications of the resource will break the 398 fragment identifier. If applications want to create more robust 399 fragment identifiers, they may do so by adding hash sums to fragment 400 identifiers. These hash sums are used to detect a change in the 401 resource. Applications can then warn users about the possibility 402 that a fragment identifier might have been broken by a modification 403 of the resource. 405 Since fragment identifiers are interpreted by clients, hash sums are 406 defined on MIME entities rather than the resource itself, and as such 407 are specific to a certain representation of the resource, in case of 408 text/plain resources the character encoding of the MIME entity. 410 Hash sums may specify the character encoding that has been used when 411 creating the hash sums, and if such a specification is present, 412 clients MUST check whether the character encoding specified for the 413 hash sum and the character encoding of the retrieved MIME entity are 414 equal, and clients MUST NOT check the hash sum if these values 415 differ. However, clients MAY choose to transcode the retrieved MIME 416 entity in the case of differing character encodings, and after doing 417 so, check the hash sum. Please note that this method is inhererently 418 unreliable, because certain characters or character sequences may 419 have been lost or normalized due to restrictions in one of the 420 character encodings used. 422 3. Fragment Identification Syntax 424 The syntax for the fragment identifiers is straightforward. The 425 syntax defines four schemes, 'char', 'line', 'match', and hash (which 426 can either be 'length' or 'md5'). The 'char' and 'line' schemes can 427 be used in two different variants, either the position variant (with 428 a single number), or the range variant (with two comma-separated 429 numbers). The 'match' scheme has a regular expression as its 430 parameter, which must be specified as a string with escaped 431 semicolons (because the semicolon is used to concatenate multiple 432 fragment identification scheme parts). The hash scheme can either 433 use the 'length' or the 'md5' scheme to specify a hash value. 435 The following syntax definition uses ABNF as defined in RFC 4234 [7], 436 including the rules DIGIT and HEXDIG. 438 text-fragment = text-scheme 0*( ";" text-scheme) 0*( ";" hash-scheme) 439 text-scheme = ( char-scheme / line-scheme / match-scheme ) 440 hash-scheme = ( length-scheme / md5-scheme ) [ "," charenc ] 441 char-scheme = %x63.68.61.72 "=" ( position / range ) ; "char=" 442 line-scheme = %x6C.69.6E.65 "=" ( position / range ) ; "line=" 443 match-scheme = %x6D.61.74.63.68 "=" regex ; "match=" 444 position = number 445 range = (position "," [ position ]) / ("," position ) 446 number = 1*( DIGIT ) 447 regex = StringWithEscapedSemicolon 448 length-scheme = %x6C.65.6E.67.74.68 "=" number ; "length=" 449 md5-scheme = %x6D.64 "5=" md5-value ; "md5=" 450 md5-value = 32HEXDIG 451 charenc = StringWithEscapedSemicolon 453 The StringWithEscapedSemicolon is a string where all characters may 454 appear literally (except the characters which are required by the URI 455 syntax to be escaped), with the exception of a semicolon. A 456 semicolon that is part of the regular expression must be escaped with 457 a leading backslash, and implementations MUST properly interpret 458 regular expressions, dereferencing all escape mechanisms that apply, 459 i.e. any escaping present due to the context of the URI, semicolon 460 escaping, URI percent-encoding, and BRE escaping, in that order). 462 3.1. Non-ASCII Characters in Regular Expressions 464 RFC 3986 [4] only allows a subset of ASCII as characters in URIs. 465 Non-ASCII octets can be included using percent-encoding. Non-ASCII 466 characters in regular expressions MUST be encoded using UTF-8 [8] 467 before applying percent-encoding, and MUST be interpreted using UTF-8 468 after resolving percent-encoding. Therefore, using Internationalized 469 Resource Identifiers (IRIs) [9] it is possible to use non-ASCII 470 characters directly in regular expressions. Implementations that 471 support plain text fragment identifiers for documents not encoded in 472 US-ASCII SHOULD support regular expressions with non-ASCII 473 characters, or MUST ignore such regular expressions. 475 3.2. Hash Sums 477 A hash sum can either specify a MIME entity's length, or its MD5 478 fingerprint. In both cases, it can optionally specify the character 479 encoding which had been used when calculating the hash sum, so that 480 clients interpreting the fragment identifier may check whether they 481 are using the same character encoding for their calculations. For 482 lenghts, the character encoding can be necessary because it can 483 influence the character count. As an example, Unicode includes 484 precomposed characters for writing Vietnamese, but in the windows- 485 1258 encoding, also used for writing Vietnamese, some characters have 486 to be encoded with separate diacritics, which means that two 487 characters are counted. Applying Unicode terminology, this means 488 that the length of a text/plain MIME entity is computed based on its 489 "code points". For MD5 fingerprints, the character encoding is 490 necessary because the MD5 algorithm works on the binary 491 representation of the text/plain resource. 493 The length of a text/plain MIME entity is calculated by using the 494 principles defined in Section 2.1.2. The MD5 fingerprint of a text/ 495 plain MIME entity is calculated by using the algorithm presented in 496 [10], encoding the result in 16 hexadecimal digits (using uppercase 497 or lowercase letters) as a representation of the 128 bits which are 498 the result of the MD5 algorithm. 500 4. Fragment Identifier Processing 502 4.1. Handling of Line Endings in text/plain MIME Entities 504 In Internet messages, line endings in text/plain MIME entities are 505 represented by CR+LF character sequences (see RFC 2046 [1] and RFC 506 3676 [3]). However, some protocols (such as HTTP) in addition allow 507 other conventions for line breaks. Also, some operating systems 508 store text/plain entities locally with different line endings (in 509 most cases, Unix uses LF, MacOS uses CR, and Windows uses CR+LF). 511 Independent of the number of bytes or characters used to represent a 512 line ending, each line ending MUST be counted as one single 513 character. For the purpose of regular expression matching, all 514 representations of line endings MUST be treated as single LF 515 characters (matched by \n). Implementations interpreting text/plain 516 fragment identifiers MUST take into account the line ending 517 conventions of the protocols and other contexts that they work in. 519 As an example, an implementation working in the context of a Web 520 browser supporting http: URIs has to support the various line ending 521 conventions permitted by HTTP. As another example, an implementation 522 used on local files (e.g. with the file: URI scheme) has to support 523 the conventions used for local storage. All implementations SHOULD 524 support the Internet-wide CR+LF line ending convention, and MAY 525 support additional conventions not related to the protocols or 526 systems they work with. 528 Implementers should be aware of the fact that line endings in plain 529 text entities can be represented by other characters or character 530 sequences than CR+LF. Besides the abovementioned CR and LF, there 531 are also NEL and CR+NEL. In general, the encoding of line endings 532 can also depend on the character encoding of the MIME entity, and 533 implementations have to take this into account where necessary. 535 4.2. Handling of Position Values 537 If any position value (as a position or as part of a range) is 538 greater than the length of the actual MIME entity, then it identifies 539 the last character or line position of the MIME entity. If the first 540 position value in a range is not present, then the range extends from 541 the start of the MIME entity. If the second position value in a 542 range is not present, then the range extends to the end of the MIME 543 entity. If a range scheme's positions are not properly ordered (ie, 544 the first number is less than the second), then this scheme part MUST 545 be ignored. 547 4.3. Handling of Hash Sums 549 Clients are not required to implement the handling of hash sums, so 550 they MAY choose to ignore hash sum information altogether. However, 551 if they do implement hash sum handling, the following applies: 553 If a fragment identifier contains a hash sum, and a client retrieves 554 a MIME entity and detects that the hash sum has changed (observing 555 the character encoding specification as described in Section 3.2, if 556 present), then the client SHOULD NOT interpret any other text/plain 557 fragment identifier scheme part. A client MAY signal this situation 558 to the user. 560 4.4. Syntax Errors in Fragment Identifiers 562 If a fragment identifier contains a syntax error (i.e., does not 563 conform to the syntax specified in Section 3), then it MUST be 564 ignored by clients. Clients SHOULD NOT make any attempt to correct 565 or guess fragment identifiers. Syntax errors MAY be reported by 566 clients. 568 5. Examples 570 The following examples show some usages for the fragment identifiers 571 defined in this memo. 573 http://example.com/text.txt#char=100 575 This URI identifies the position after the 100th character of the 576 text.txt MIME entity. It should be noted that it is not clear which 577 octet(s) of the MIME entity this will be without retrieving the MIME 578 entity and thus knowing which character encoding it is using (in case 579 of HTTP, this information will be given in the Content-Type header of 580 the response). If the MIME entity has fewer than 100 characters, the 581 URI identifies the position after the MIME entity's last character. 583 ftp://example.com/text.txt#line=10,20 585 This URI identifies lines 11 to 20 of the text.txt MIME entity. If 586 the MIME entity has fewer than 11 lines, it identifies the position 587 after last line. If the MIME entity has less than 20 but at least 11 588 lines, it identifies the lines 11 to the last line of the MIME 589 entity. 591 http://example.com/text.txt#match=searchterm 593 This URI identifies all occurrences of the regular expression 594 'searchterm' in the MIME entity, i.e., all occurrences of the string 595 'searchterm'. If there is more than one occurrence, then this URI 596 identifies a disjoint fragment, consisting of all of these 597 occurrences. If there is no occurrence of the search term, the URI 598 does not identify a fragment. 600 ftp://example.com/text.txt#line=,1;match=searchterm 602 This URI identifies the first line and all occurrences of the regular 603 expression 'searchterm' in the MIME entity. If there is an 604 occurrence of 'searchterm' outside of the first line, then this URI 605 identifies a disjoint fragment. 607 http://example.com/text.txt#match=hello\; 609 This URI identifies all occurrences of the regular expression 610 'hello;' in the MIME entity. The semicolon with the leading 611 backslash has to be interpreted as a literal semicolon inside of the 612 BRE, treating the '\;' as an escaped ';', so that the actual regular 613 expression is 'hello;'. If there is more than one occurrence of this 614 regular expression, then this URI identifies a disjoint fragment, 615 consisting of all of these occurrences. 617 ftp://example.com/text.txt#line=10,20;length=9876,UTF-8 619 As in the second example, this URI identifies lines 11 to 20 of the 620 text.txt MIME entity. The additional length hash sum specifies that 621 the MIME entity has a length of 9876 characters when encoded in 622 UTF-8. If the client supports the length hash sum scheme, it may 623 test the retrieved MIME entity for its length, but only if the 624 retrieved MIME entity uses the UTF-8 encoding or has been locally 625 trancoded into this encoding. If the length of the retrieved MIME 626 entity does not match the length specified in the fragment 627 identifier, the client SHOULD NOT interpret the line part and MAY 628 signal this to the user. 630 6. IANA Considerations 632 Note to RFC Editor: Please change this section to read as follows 633 after the IANA action has been completed: "IANA has added a reference 634 to this specification in the Text/Plain Media Type registration." 636 IANA is requested to update the registration of the MIME Media type 637 text/plain at http://www.iana.org/assignments/media-types/text/ with 638 the fragment identifier defined in this memo by adding a reference to 639 this memo (with the appropriate RFC number once it is known). 641 7. Security Considerations 643 Regular expression matching code is notoriously vulnerable to buffer 644 overflow security holes, so any implementation supporting text/plain 645 fragment identifiers SHOULD make sure that the code being used has 646 been tested against buffer overflow attacks. 648 The fact that software implementing fragment identifiers for plain 649 text and software not implementing them differs in behavior, and the 650 fact that different software may show fragments to users in different 651 ways (in particular for fragments consisting of multiple ranges) can 652 lead to misunderstandings on the part of users. Such 653 misunderstandings might be exploited in a way similar to spoofing or 654 phishing, although concrete examples of how this might be done are 655 not currently known. 657 Implementers and users of fragment identifiers for plain text should 658 also be aware of the security considerations in RFC 3986 [4] and RFC 659 3987 [9]. 661 8. Change Log 663 Note to RFC Editor: Please remove this section before publication. 665 8.1. From -05 to -06 667 o Clarified that this is intended as an update of the text/plain 668 MIME type registration, in newly added IANA consideration section 669 and elswhere. 671 o Added normative reference to UTF-8 (STD63/RFC3629). 673 o Fixed section about non-ASCII characters in regular expressions to 674 be more accurate re. IRIs. 676 o Fixed some text about decomposition and Unicode. 678 o Clarified that UTF-16 can also use 4 octets per character. 680 o Changed ABNF to make sure schemes are case-sensitive (string 681 literals in ABNF are case-insensitive). 683 o Used HEXDIG from RFC 4234, made clear DIGIT and HEXDIG are from 684 that spec. 686 o Speficied order of decoding the various escapings. 688 o Moved section on line endings to the back, and changed 689 requirements to be more in line with practice. 691 o Added IANA Consideration section. 693 o Expanded Security Consideration section. 695 o Removed quote from RFC 3986, because the quoted text doesn't 696 actually exist there anymore; changed text appropriately. 698 o Reorganized section two to get rid of one section level. 700 o Added overview in introduction, and some glue text here and there. 702 o Changed to more IETF-like wording in some instances (e.g. intro to 703 this section; removing "Compliant software MUST follow this 704 specification." at the start of the Introduction,...). 706 o Removed 'where to send comments' section. 708 o Fixed wording is some cases, tried to make shorter sentences and 709 eliminate parenthetesized expressions. 711 o Removed acknowledgement for xml2rfc; we are nevertheless very 712 grateful for this work! 714 8.2. From -04 to -05 716 o Added some explanatory text to the last paragraph of Section 2.5. 718 o Added a paragraph about the importance of having fragment 719 identification capabilities for out-of-line linking methods such 720 as XLink to Section 1.3. 722 o Added explanation of why the charset is important for length hash 723 sums to Section 3.2. 725 o Added text that makes hash sum handling optional and allows 726 clients to interpret fragment identifiers even if the hash sum did 727 not match (changed MUST NOT to SHOULD NOT) to Section 4.3. 729 o Added example using a length hash sum in Section 5. 731 o RFC 2234 (ABNF) has been obsoleted by [7]. 733 o Removed the "Open Issues" section for preparation of final draft 734 before submission as RFC. 736 8.3. From -03 to -04 738 o URIs are now defined by RFC 3986 [4], so the text and the 739 references have been updated. In particular, RFC3986 defines a 740 fragment identifier to be part of the URI, whereas in the 741 obsoleted RFC 2396 URI specification, it was not part of a URI as 742 such, but of a "URI reference". 744 o IRIs are now defined by RFC 3987 [9], so the text and the 745 references have been updated. 747 o Changed IPR clause from RFC 3667 to RFC 3978 (updated version of 748 RFC 3667). 750 8.4. From -02 to -03 752 o Replaced most occurrences of 'resource' with 'MIME entity', 753 because the result of dereferencing a URI is not the resource 754 itself, but some MIME entity (in our case of type text/plain) 755 representing it. Thanks to Sandro Hawke for pointing this out. 757 o Moved "Open Issues" to the very back of the document. 759 o Added Section 4 to define the processing model for fragment 760 identifiers (moved Section 4.2 from Section 3 to Section 4). 762 o Added hash scheme to make fragment identifiers more robust 763 (Section 2.5). 765 o Changed IPR clause from RFC 2026 to RFC 3667 (updated version of 766 RFC 2026). 768 8.5. From -01 to -02 770 o Fundamental change in semantics: counts turn into positions 771 (between characters or lines), so in order to identify a character 772 or line, ranges must be used (which now use positions to specify 773 the upper and lower bounds of the range). 775 o Made the first value of a range optional as well, so that line=,5 776 also is legal, identifying everything from the start of the MIME 777 entity to the 5th line. 779 o Changed the syntax from paranthesis-style to a more traditional 780 style using equals-signs. 782 8.6. From -00 to -01 784 o Made the second count value of ranges optional, so that something 785 like line(10,) is legal and properly defined. 787 o Added non-normative reference to Internet draft about non-ASCII 788 characters in search strings. 790 o Added Section 1.4 about incremental deployement. 792 o Added more elaborate examples. 794 o Added text about regex buffer overflow problems in Section 7. 796 o Added Section 4.1 about line endings in text/plain resources. 798 o Added "Open Issues" to collect open issues regarding this memo 799 (will be deleted in final RFC text). 801 9. References 803 9.1. Normative References 805 [1] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 806 Extensions (MIME) Part Two: Media Types", RFC 2046, 807 November 1996. 809 [2] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 810 Extensions (MIME) Part One: Format of Internet Message Bodies", 811 RFC 2045, November 1996. 813 [3] Gellens, R., "The Text/Plain Format and DelSp Parameters", 814 RFC 3676, February 2004. 816 [4] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 817 Resource Identifier (URI): Generic Syntax", RFC 3986, 818 January 2005. 820 [5] Bradner, S., "Key words for use in RFCs to Indicate Requirement 821 Levels", RFC 2119, March 1997. 823 [6] International Organization for Standardization, "Information 824 technology - Portable Operating System Interface (POSIX) - Part 825 2: Shell and Utilities", ISO 9945-2, 1993. 827 [7] Crocker, D. and P. Overell, "Augmented BNF for Syntax 828 Specifications: ABNF", RFC 4234, October 2005. 830 [8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", 831 STD 63, RFC 3629, November 2003. 833 [9] Duerst, M. and M. Suignard, "Internationalized Resource 834 Identifiers (IRI)", RFC 3987, January 2005. 836 [10] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, 837 April 1992. 839 9.2. Non-Normative References 841 [11] Connolly, D. and L. Masinter, "The 'text/html' Media Type", 842 RFC 2854, June 2000. 844 [12] Freed, N. and J. Klensin, "Media Type Specifications and 845 Registration Procedures", RFC 4288, December 2005. 847 [13] DeRose, S., Maler, E., and D. Orchard, "XML Linking Language 848 (XLink) Version 1.0", W3C Recommendation REC-xlink-20010627, 849 June 2001. 851 [14] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646", 852 RFC 2781, February 2000. 854 Appendix A. Acknowledgements 856 Thanks for comments and suggestions provided by Marcel Baschnagel, 857 John Cowan, Benja Fallenstein, Sandro Hawke, Dan Kohn, Henrik 858 Levkowetz, and Ted Hardie. 860 Authors' Addresses 862 Erik Wilde 863 UC Berkeley 864 School of Information 865 Berkeley, CA 94720-4600 866 U.S.A. 868 Phone: +1-510-6432253 869 Email: net.dret@dret.net 870 URI: http://dret.net/netdret/ 872 Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever 873 possible, for example as "Dürst" in XML and HTML.) 874 Aoyama Gakuin University 875 5-10-1 Fuchinobe 876 Sagamihara, Kanagawa 229-8558 877 Japan 879 Phone: +81 42 759 6329 880 Fax: +81 42 759 6495 881 Email: mailto:duerst@it.aoyama.ac.jp 882 URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ 884 Full Copyright Statement 886 Copyright (C) The Internet Society (2007). 888 This document is subject to the rights, licenses and restrictions 889 contained in BCP 78, and except as set forth therein, the authors 890 retain all their rights. 892 This document and the information contained herein are provided on an 893 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 894 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 895 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 896 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 897 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 898 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 900 Intellectual Property 902 The IETF takes no position regarding the validity or scope of any 903 Intellectual Property Rights or other rights that might be claimed to 904 pertain to the implementation or use of the technology described in 905 this document or the extent to which any license under such rights 906 might or might not be available; nor does it represent that it has 907 made any independent effort to identify any such rights. Information 908 on the procedures with respect to rights in RFC documents can be 909 found in BCP 78 and BCP 79. 911 Copies of IPR disclosures made to the IETF Secretariat and any 912 assurances of licenses to be made available, or the result of an 913 attempt made to obtain a general license or permission for the use of 914 such proprietary rights by implementers or users of this 915 specification can be obtained from the IETF on-line IPR repository at 916 http://www.ietf.org/ipr. 918 The IETF invites any interested party to bring to its attention any 919 copyrights, patents or patent applications, or other proprietary 920 rights that may cover technology that may be required to implement 921 this standard. Please address the information to the IETF at 922 ietf-ipr@ietf.org. 924 Acknowledgment 926 Funding for the RFC Editor function is provided by the IETF 927 Administrative Support Activity (IASA).