idnits 2.17.1 draft-wilde-text-fragment-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 810. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 787. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 794. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 800. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 2 instances of too long lines in the document, the longest one being 28 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (Jan 6, 2006) is 6686 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '6' ** Obsolete normative reference: RFC 4234 (ref. '7') (Obsoleted by RFC 5234) ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref. '9') -- Obsolete informational reference (is this intentional?): RFC 2629 (ref. '13') (Obsoleted by RFC 7749) Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group E. Wilde 3 Internet-Draft ETH Zurich 4 Expires: July 10, 2006 Jan 6, 2006 6 URI Fragment Identifiers for the text/plain Media Type 7 draft-wilde-text-fragment-05 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on July 10, 2006. 34 Copyright Notice 36 Copyright (C) The Internet Society (2006). 38 Abstract 40 This memo defines URI fragment identifiers for text/plain MIME 41 entities. These fragment identifiers make it possible to refer to 42 parts of a text MIME entity, identified by character count or range, 43 line count or range, or a regular expression. These identification 44 methods can be combined to identify more than one sub-resource of a 45 text/plain MIME entity. Fragment identifiers may also contain hash 46 information to make them more robust. 48 Table of Contents 50 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 51 1.1. What is text/plain? . . . . . . . . . . . . . . . . . . . 3 52 1.1.1. Line Endings in text/plain MIME Entities . . . . . . . 3 53 1.2. What is a URI Fragment Identifier? . . . . . . . . . . . . 4 54 1.3. Why text/plain Fragment Identifiers? . . . . . . . . . . . 4 55 1.4. Incremental Deployment . . . . . . . . . . . . . . . . . . 5 56 2. Fragment Identification Methods . . . . . . . . . . . . . . . 5 57 2.1. Fragment Identification Schemes . . . . . . . . . . . . . 6 58 2.1.1. Principles . . . . . . . . . . . . . . . . . . . . . . 6 59 2.1.2. Combining the Principles . . . . . . . . . . . . . . . 7 60 2.1.3. Regular Expressions . . . . . . . . . . . . . . . . . 8 61 2.1.4. Combining Fragment Identification Scheme Parts . . . . 9 62 2.2. Fragment Identifier Robustness . . . . . . . . . . . . . . 9 63 3. Fragment Identification Syntax . . . . . . . . . . . . . . . . 10 64 3.1. Non-ASCII Characters in Regular Expressions . . . . . . . 11 65 3.2. Hash Sums . . . . . . . . . . . . . . . . . . . . . . . . 11 66 4. Fragment Identifier Processing . . . . . . . . . . . . . . . . 11 67 4.1. Handling of position Values . . . . . . . . . . . . . . . 11 68 4.2. Handling of Hash Sums . . . . . . . . . . . . . . . . . . 12 69 4.3. Syntax Errors in Fragment Identifiers . . . . . . . . . . 12 70 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 71 6. Security Considerations . . . . . . . . . . . . . . . . . . . 14 72 7. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 14 73 7.1. From -04 to -05 . . . . . . . . . . . . . . . . . . . . . 14 74 7.2. From -03 to -04 . . . . . . . . . . . . . . . . . . . . . 14 75 7.3. From -02 to -03 . . . . . . . . . . . . . . . . . . . . . 15 76 7.4. From -01 to -02 . . . . . . . . . . . . . . . . . . . . . 15 77 7.5. From -00 to -01 . . . . . . . . . . . . . . . . . . . . . 15 78 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 79 8.1. Normative References . . . . . . . . . . . . . . . . . . . 16 80 8.2. Non-Normative References . . . . . . . . . . . . . . . . . 16 81 Appendix A. Where to send Comments . . . . . . . . . . . . . . . 17 82 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 17 83 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 18 84 Intellectual Property and Copyright Statements . . . . . . . . . . 19 86 1. Introduction 88 Compliant software MUST follow this specification. The capitalized 89 key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 90 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 91 document are to be interpreted as described in RFC 2119 [1]. 93 1.1. What is text/plain? 95 Internet Media Types as defined in RFC 2045 [2] and RFC 2046 [3] are 96 used to identify different types and sub-types of media. RFC 2046 97 [3] and RFC 3676 [4] specify the text/plain media type, which is used 98 for simple, unformatted text. Quoting from RFC 2046 [3]: "Plain text 99 does not provide for or allow formatting commands, font attribute 100 specifications, processing instructions, interpretation directives, 101 or content markup. Plain text is seen simply as a linear sequence of 102 characters, possibly interrupted by line breaks or page breaks." 104 The text/plain media type does not restrict the character encoding, 105 any character encoding may be used. In the absence of an explicit 106 character encoding declaration, US-ASCII is assumed as the default 107 character encoding. This variability of the character encoding makes 108 it impossible to count characters in a text/plain MIME entity without 109 taking the character encoding into account, because there are many 110 character encodings using more than one octet per character. 112 The biggest advantage of text/plain MIME entities is their ease of 113 use and their portability among different platforms. As long as they 114 use popular character encodings (such as US-ASCII), they can be 115 displayed and processed on virtually every computer system. 117 1.1.1. Line Endings in text/plain MIME Entities 119 RFC 2046 [3] and RFC 3676 [4] specify that line endings in text/plain 120 MIME entities are represented by CR+LF character sequences. In 121 implementation practice, however, text/plain MIME entities use 122 different conventions, for example depending on the operating system 123 they have been created with (in most cases, Unix uses LF, MacOS uses 124 CR, and Windows uses CR+LF). Because of this diversity of 125 conventions, implementations interpreting text/plain fragment 126 identifiers MUST take different line ending conventions into account. 128 Line endings in text/plain MIME entities MAY be represented by other 129 character (sequences) than CR+LF, specifically CR, LF, NEL, and CR+ 130 NEL. All these character (sequences) MUST be interpreted as line 131 endings. This interpretation MUST affect the evaluation of text/ 132 plain fragment identifiers. All representations of line endings 133 (CR+LF, CR, LF, NEL, and CR+NEL) MUST be treated as a single 134 character in character counts. For the purpose of regular expression 135 matching, all representations of line endings MUST be treated as 136 single LF characters. The reason for this is that fragment 137 identifiers should not be broken by converting a file from one line 138 ending convention to another. 140 In general, the line ending conventions used in text/plain MIME 141 entities depends on the character encoding of the MIME entity. 142 Implementations SHOULD attempt to be as accurate as possible in 143 recognizing line ending specific to particular character encodings, 144 and MUST treat all these line endings as one character in character 145 counts, and single LF characters for regular expression matching. 147 1.2. What is a URI Fragment Identifier? 149 URIs are the identification mechanism for resources on the Web. The 150 URI syntax specified in RFC 3986 [5] includes as part of a URI a 151 fragment identifier, which (quoting from RFC 3986 [5]) "consists of 152 additional reference information to be interpreted by the user agent 153 after the retrieval action has been successfully completed. As such, 154 it is not part of a URI, but is often used in conjunction with a URI. 155 The semantics of a fragment identifier is a property of the data 156 resulting from a retrieval action, regardless of the type of URI used 157 in the reference. Therefore, the format and interpretation of 158 fragment identifiers is dependent on the media type of the retrieval 159 result." 161 The most popular fragment identifier is defined for text/html 162 (defined in RFC 2854 [10]), and makes it possible to refer to a 163 specific element (identified by a 'name' or 'id' attribute) of an 164 HTML document. 166 1.3. Why text/plain Fragment Identifiers? 168 Referring to specific parts of a resource can be very useful, because 169 it enables users and applications to create more specific references. 170 Rather than pointing to a whole resource, users can create references 171 to the part they really are interested in or want to talk about. 172 Even though it is suggested that fragment identification methods are 173 specified in a media type's MIME registration, many media types do 174 not have fragment identification methods associated with them. 176 Fragment identifiers are only useful if supported by the client, 177 because they are only interpreted by the client. Therefore, a new 178 fragment identification method will require some time to be adopted 179 by clients, and older clients will not support it. However, because 180 the URI still works even if the fragment identifier is not supported 181 (the resource is retrieved, but the fragment identifier is not 182 interpreted), rapid adoption is not highly critical to ensure the 183 success of a new fragment identification method. 185 Fragment identifiers for text/plain make it possible to refer to 186 specific parts of a text MIME entity, using concepts of positions and 187 ranges, which may be applied to characters and lines. The also 188 support locating a fragment by using a regular expression for 189 searching for a specific character sequence. Thus, text/plain 190 fragment identifiers enable users to exchange information more 191 specifically, thereby reducing time and effort that is necessary to 192 manually search for the relevant part of a text/plain MIME entity. 194 The text/plain format does not support the embedding of links, so in 195 normal environments, text/plain resources can only serve as targets 196 for links, and not as sources. However, when combining the text/ 197 plain fragment identifiers specified in this memo with out-of-line 198 linking mechanisms such as XLink [11], it is possible to "embed" link 199 sources into plain/text resources. Thus, the text/plain fragment 200 identifiers specified in this memo open a path for plain/text files 201 to become fully integrated resources in hypermedia systems such as 202 the Web. 204 1.4. Incremental Deployment 206 As long as support for text/plain fragment identifiers is not 207 implemented by all programs, it is important to consider the 208 implications of incremental deployment. Clients (for example, Web 209 browsers) not supporting the text/plain fragment identifier described 210 in this memo will work with URI references to text/plain MIME 211 entities, but they will fail to locate the sub-resource identified by 212 the fragment identifier. This is a reasonable fallback behavior, and 213 in general users should take into account the possibility that a 214 program interpreting a given URI will fail to interpret the fragment 215 identifier part. Since fragment identifier evaluation is local to 216 the client (and happens after retrieving the MIME entity), there is 217 no way for a server to determine whether a requesting client is using 218 a URI containing a fragment identifier. 220 2. Fragment Identification Methods 222 The identification of fragments of text/plain MIME entities can be 223 based on different foundations. Since it is not possible to insert 224 explicit, invisible identifiers into a text/plain MIME entity (as for 225 example used in HTML documents, implemented through special 226 attributes), fragment identification has to rely on certain inherent 227 criteria of the MIME entity. This memo specifies fragment 228 identification using six different methods, which are character 229 positions and ranges, line positions and ranges, regular expression 230 matching, and a mechanism for improving the robustness of fragment 231 identifiers (entity hashes). 233 When interpreting character or line numbers, implementations MUST 234 take the character encoding of the MIME entity into account, because 235 character count and octet count may differ for the character encoding 236 being used. For example, a MIME entity using UTF-16 encoding (as 237 specified in RFC 2718 [12]) uses two octets per character, and it may 238 have a leading BOM (Byte-Order Mark), which does not count as a 239 character and thus also affects the mapping from a simple octet count 240 to a character count. 242 2.1. Fragment Identification Schemes 244 Fragment identification can be done using regular expressions or 245 combining two orthogonal principles, which are positions and ranges, 246 and characters and lines. The following section describe the 247 principles themselves, while Section 2.1.2 describes the combination 248 of the principles. 250 2.1.1. Principles 252 2.1.1.1. Positions and Ranges 254 A position does not identify an actual fragment of the MIME entity, 255 but a position inside the MIME entity, which could be regarded as a 256 fragment of zero length. The use case for positions is to provide 257 pointers for applications which may use them to implement 258 functionalities such as "insert some text here", which needs a 259 position rather than a fragment. Positions are counted from zero 260 (position zero being before the first character or line of a text/ 261 plain MIME entity), so that a text/plain MIME entity having one 262 character has two positions, one before the first character (position 263 0), and one after the first character (position 1). 265 Since positions are fragments of length zero, applications SHOULD use 266 other methods than highlighting to indicate positions, the most 267 obvious way being the positioning of a cursor (if the application 268 supports the concept of a cursor). 270 Ranges, on the other hand, identify fragments of a MIME entity that 271 have a length that may be greater than zero. As a general principle 272 for ranges, they specify both a lower and a upper bound. The start 273 or the end of a range specification may be omitted, defaulting to the 274 first repectively last position of the MIME entity. The ending 275 position of a range must have a value greater than or equal to the 276 lower position (consequently, a range with identical lower and upper 277 positions is legal, and identifies a range of length 0, which is 278 equivalent to a position). Counting for ranges uses positions, so 279 that a fragment containing one entity is specified by using a range 280 with two adjacent positions. 282 Since ranges are fragments with a length greater than zero, 283 applications SHOULD use methods like highlighting to indicate ranges 284 (if the application supports the concept of highlighting). 286 For positions and ranges it is implicitly assumed that if a number is 287 greater than the actual number of elements in the MIME entity, then 288 it is referring to the last element of the MIME entity (see Section 4 289 for the processing model). 291 2.1.1.2. Characters and Lines 293 The concept of positions and ranges may be applied to characters and 294 lines. In both cases, positions indicate points between entities, 295 while ranges identify zero or more entities by indicating positions. 297 Character positions are numbered starting with zero (ignoring initial 298 BOM marks or similar concepts that are not part of the actual textual 299 content of a text/plain MIME entity), and counting each character 300 separately, with the exception of line endings, which are always 301 counted as one character (Section 1.1.1 describes how line endings 302 MUST be identified). 304 Line positions are numbered starting with zero (with line position 305 zero always being identical with character position zero), with 306 Section 1.1.1 describing how line endings MUST be identified. 307 Fragments identified by lines include the line endings, so 308 applications identifying line-based fragments MUST include the line 309 endings in the fragment identification they are using (eg, the 310 highlighted selection). If a MIME entity does not contain any line 311 endings, then it consists of a single (the first) line. 313 2.1.2. Combining the Principles 315 In the following sections, the principles described in the preceding 316 section (positions/ranges and characters/lines) are combined, 317 resulting in four use cases. 319 2.1.2.1. Character Position 321 Using the char scheme followed by a single number, it is possible to 322 point to a character position (ie, a fragment of length zero between 323 two characters). Rather than identifying a fragment consisting of a 324 number of characters, this method identifies a position between two 325 characters (or before the first or after the last character). 326 Character position counting starts with 0, so the character position 327 before the first character of a text/plain MIME entity has the 328 character position 0, and a MIME entity containing n distinct 329 characters has n+1 distinct character positions, the last one having 330 the character position n. 332 2.1.2.2. Character Range 334 If it is necessary to identify a fragment of one or more characters 335 using character counting, this can be done by using a character 336 range, using the char scheme followed by a range specification. A 337 character range is a consecutive region of the MIME entity that 338 extends from the starting character position of the range to the 339 ending character position of the range. 341 2.1.2.3. Line Position 343 Using the line scheme followed by a single number, it is possible to 344 point to a line position (ie, a fragment of length zero between two 345 lines). Rather than identifying a fragment consisting of a number of 346 lines, this method identifies a position between two lines (or before 347 the first or after the last line). Line position counting starts 348 with 0, so the line position before the first line of a text/plain 349 MIME entity has the line position 0, and a MIME entity containing n 350 distinct lines has n+1 distinct line positions, the last one having 351 the line position n. 353 2.1.2.4. Line Range 355 If it is necessary to identify a fragment of one or more lines using 356 line counting, this can be done by using a line range, using the line 357 scheme followed by a range specification. A line range is a 358 consecutive region of the MIME entity that extends from the starting 359 line position of the range to the ending line position of the range. 361 2.1.3. Regular Expressions 363 A common problem with fragment identifiers is their robustness (to 364 changes in the MIME entity), and character and line counts can break 365 very easily. A more robust way of identifying a fragment is by 366 searching for a specific pattern (another way of making fragment 367 identifiers more robust is described in Section 2.2 about including 368 entity hash sums in the fragment identifier). Thus, it is possible 369 to use a Basic Regular Expression (BRE) as defined by ISO 9945-2 [6] 370 (the POSIX standard) as a fragment identifier. 372 2.1.4. Combining Fragment Identification Scheme Parts 374 While in most cases only one fragment identification scheme part will 375 be used, it is possible to combine them. By simply concatenating 376 different fragment identification scheme parts, separated by a 377 semicolon, the whole fragment identifier refers to the union of all 378 fragments of the text/plain MIME entity identified by the individual 379 fragment identification scheme parts. This way, it is possible to 380 identify disjoint ranges, such as multiple line ranges. 382 It should be noticed that regular expressions by themselves may 383 identify disjoint fragments, which is true in any case where the 384 regular expression matches more than one occurrence in the MIME 385 entity. 387 Since disjoint fragments can be identified, implementations SHOULD 388 make sure that these fragments are appropriately marked, for example 389 by highlighting the fragment (rather than only scrolling to some 390 line, which only identifies a single position in the MIME entity). 391 If an implementation can not mark disjoint fragments, it MAY resort 392 to marking only the first of the disjoint fragments. However, the 393 exact method of how implementations deal with disjoint fragments 394 depends on the application and interface, and is beyond the scope of 395 this memo. 397 2.2. Fragment Identifier Robustness 399 While regular expressions (as described in Section 2.1.3) may make 400 fragment identifiers more robust than character or line counts, it is 401 still possible that modifications of the resource will break the 402 fragment identifier. If applications want to create more robust 403 fragment identifiers, they may do so by adding hash sums to fragment 404 identifiers. These hash sums are used to detect a change in the 405 resource, so that applications may warn users about the possibility 406 that a fragment identifier might have been broken by a modification 407 of the resource. Since fragment identifiers are interpreted by 408 clients, hash sums are defined on MIME entities rather than the 409 resource itself, and as such are specific to a certain representation 410 of the resource, in case of text/plain resources the character 411 encoding of MIME entity. 413 Hash sums may specify the character encoding that has been used when 414 creating the hash sums, and if such a specification is present, 415 clients MUST check whether the character encoding specified for the 416 hash sum and the character encoding of the retrieved MIME entity are 417 equal, and clients MUST NOT check the hash sum if these values 418 differ. However, clients MAY choose to transcode the retrieved MIME 419 entity in the case of differing character encodings, and after doing 420 so, they MAY check the hash sum (please note that this method is 421 inhererently unreliable, though, because certain characters or 422 character sequences may have been lost or normalized due to 423 restrictions of the coded character set). 425 3. Fragment Identification Syntax 427 The syntax for the fragment identifiers is straightforward. The 428 syntax defines four schemes, 'char', 'line', 'match', and hash (which 429 can either be 'length' or 'md5'). The 'char' and 'line' can be used 430 in two different variants, either the position variant (with a single 431 number), or the range variant (with two comma-separated positions). 432 The 'match' scheme has a regular expression as parameter, which must 433 be specified as a string with escaped semicolons (because the 434 semicolon is used to concatenate multiple fragment identification 435 scheme parts). The hash scheme can either use the 'length' or the 436 'md5' scheme to specify a hash value. 438 The following syntax definition uses ABNF as defined in RFC 4234 [7]. 440 text-fragment = text-scheme 0*( ";" text-scheme) 0*( ";" hash-scheme) 441 text-scheme = ( char-scheme / line-scheme / match-scheme ) 442 hash-scheme = ( length-scheme / md5-scheme ) [ "," charenc ] 443 char-scheme = "char=" ( position / range ) 444 line-scheme = "line=" ( position / range ) 445 match-scheme = "match=" regex 446 position = number 447 range = (position "," [ position ]) / ("," position ) 448 number = 1*( DIGIT ) 449 regex = StringWithEscapedSemicolon 450 length-scheme = "length=" number 451 md5-scheme = "md5=" md5-value 452 md5-value = 32( hexdigit ) 453 hexdigit = (DIGIT / "a" / "A" / "b" / "B" / "c" / "C" / "d" / "D" / "e" / "E" / "f" / "F" ) 454 charenc = StringWithEscapedSemicolon 456 The StringWithEscapedSemicolon is a string where all characters may 457 appear literally (except the characters which are required by the URI 458 syntax to be escaped), with the exception of a semicolon. A 459 semicolon that should be part of the regular expression must be 460 escaped with a leading backslash, and implementations MUST make sure 461 to properly interpret regular expressions, properly dereferencing all 462 escape mechanisms that apply (ie, URI encoding, semicolon escaping, 463 and BRE escaping, as well as any additional escaping that may be 464 present due to the context of the URI). 466 3.1. Non-ASCII Characters in Regular Expressions 468 RFC 3986 [5] only allows a subset of ASCII as characters in URIs. 469 Consequently, it is not possible to use non-ASCII characters in URIs. 470 However, using Internationalized Resource Identifiers (IRI) as 471 defined by RFC 3987 [8], it is possible to use non-ASCII characters, 472 using the encoding defined by IRI. Thus, using IRIs it is possible 473 to use non-ASCII characters in regular expressions, and 474 implementations MUST make sure to correctly handle any non-ASCII 475 characters in regular expressions, if they accept IRI-encoded text/ 476 plain fragment identifiers. 478 3.2. Hash Sums 480 A hash sum can either specify a MIME entity's length, or its MD5 481 fingerprint. In both cases, it can optionally specify the character 482 encoding which had been used when calculating the hash sum, so that 483 clients interpreting the fragment identifier may check whether they 484 are using the same character encoding for their calculations. For 485 lenghts, the character encoding is necessary because it may influence 486 the character count (for example, a combining a-umlaut character 487 which counts as two characters in Unicode will be collapsed to a 488 single a-umlaut character in ISO 8859 encoding). Using Unicode 489 terminology, this means that the length of a text/plain MIME entity 490 is computed based on its "code points" (other possibilities would 491 have included "code units", which depend on the encoding, and 492 "graphemes", which require knowledge about code point semantics). 493 For MD5 fingerprints, the character encoding is necessary because the 494 MD5 algorithm works on the binary representation of the text/plain 495 resource. 497 The length of a text/plain MIME entity is calculated by using the 498 principles defined in Section 2.1.1.2. The MD5 fingerprint of a 499 text/plain MIME entity is calculated by using the algorithm presented 500 in [9], encoding the result in 16 hexadecimal digits (using uppercase 501 or lowercase letters) as a representation of the 128 bit which are 502 the result of the MD5 algorithm. 504 4. Fragment Identifier Processing 506 4.1. Handling of position Values 508 If any position value (as a position or inside a range) is greater 509 than the value for the actual MIME entity, then it identifies the 510 last character or line position of the MIME entity. If the first 511 position value in a range is not present, then the range extends from 512 the start of the MIME entity. If the second position value in a 513 range is not present, then the range extends to the end of the MIME 514 entity. If a range scheme's positions are not properly ordered (ie, 515 the first number is less than the second), then this scheme part MUST 516 be ignored. 518 4.2. Handling of Hash Sums 520 Clients are not required to implement the handling of hash sums, so 521 they MAY choose to ignore hash sum information altogether. However, 522 if they do implement hash sum handling, they MUST implement it as 523 follows: 525 If a fragment identifier contains a hash sum, and a client retrieves 526 a MIME entity and detects that the hash sum has changed (observing 527 the character encoding specification as described in Section 3.2, if 528 present), then the client SHOULD NOT interpret any other text/plain 529 fragment identifier scheme part. A client MAY signal this situation 530 to the user. 532 4.3. Syntax Errors in Fragment Identifiers 534 If a fragment identifier contains a syntax error (i.e., does not 535 conform to the syntax specified in Section 3), then it MUST be 536 ignored by clients. Clients SHOULD NOT make any attempt to correct 537 or guess fragment identifiers. Syntax errors MAY be reported by 538 clients. 540 5. Examples 542 The following examples show some usages for the fragment identifiers 543 defined in this memo. 545 http://example.com/text.txt#char=100 547 This URI identifies the position after the 100th character of the 548 text.txt MIME entity. It should be noted that it is not clear which 549 octet(s) of the MIME entity this will be without retrieving the MIME 550 entity and thus knowing which character encoding is it using (in case 551 of HTTP, this information will be given in the response's Content- 552 type header). If the MIME entity has fewer than 100 characters, the 553 URI identifies the position after the MIME entity's last character. 555 http://example.com/text.txt#line=10,20 557 This URI identifies lines 11 to 20 of the text.txt MIME entity. If 558 the MIME entity has fewer than 11 lines, it identifies the position 559 after last line. If the MIME entity has less than 20 but at least 11 560 lines, it identifies the lines 11 to the last line of the MIME 561 entity. 563 http://example.com/text.txt#match=searchterm 565 This URI identifies all occurrences of the regular expression 566 'searchterm' in the MIME entity, ie all occurrences of the string 567 'searchterm'. If there is more than one occurrence, then this URI 568 identifies a disjoint fragment, consisting of all of these 569 occurrences. If there is no occurrence of the search term, the URI 570 does not identify a fragment. 572 http://example.com/text.txt#line=,1;match=searchterm 574 This URI identifies the first line and all occurrences of the regular 575 expression 'searchterm' in the MIME entity. If there is an 576 occurrence of 'searchterm' outside of the first line, then this URI 577 identifies a disjoint fragment. 579 http://example.com/text.txt#match=hello\; 581 This URI identifies all occurrences of the regular expression 582 'hello;' in the MIME entity. The semicolon with the leading 583 backslash has to be interpreted as a literal semicolon inside of the 584 BRE, treating the '\;' as an escaped ';', so that the actual regular 585 expression is 'hello;'. If there is more than one occurrence of this 586 regular expression, then this URI identifies a disjoint fragment, 587 consisting of all of these occurrences. 589 http://example.com/text.txt#line=10,20;length=9876,UTF-8 591 As in the first example, this URI identifies lines 11 to 20 of the 592 text.txt MIME entity. The additional length hash sum specifies that 593 the MIME entity has a length of 9876 code points when encoded in 594 UTF-8. If the client supports the length hash sum scheme, it may 595 test the retrieved MIME entity for its length, but only if the 596 retrieved MIME entity uses the UTF-8 encoding or has been locally 597 trancoded into this encoding. If the length of the retrieved MIME 598 entity does not match the specified length in the fragment 599 identifier, the client SHOULD NOT interpret the line part and MAY 600 signal this to the user. 602 6. Security Considerations 604 Regular expression matching code is notoriously vulnerable to buffer 605 overflow security holes, so any implementation supporting text/plain 606 fragment identifiers SHOULD make sure that the code being used has 607 been tested against buffer overflow attacks. 609 7. Change Log 611 This section will not be part of the final RFC text, it serves as a 612 container for collecting the history of the individual draft 613 versions. 615 7.1. From -04 to -05 617 o Added some explanatory text to the last paragraph of Section 2.2. 619 o Added a paragraph about the importance of having fragment 620 identification capabilities for out-of-line linking methods such 621 as XLink to Section 1.3. 623 o Added explanation of why the charset is important for length hash 624 sums to Section 3.2. 626 o Added text that makes hash sum handling optional and allows 627 clients to interpret fragment identifiers even if the hash sum did 628 not match (changed MUST NOT to SHOULD NOT) to Section 4.2. 630 o Added example using a length hash sum in Section 5. 632 o RFC 2234 (ABNF) has been obsoleted by [7]. 634 o Removed the "Open Issues" section for preparation of final draft 635 before submission as RFC. 637 7.2. From -03 to -04 639 o URIs are now defined by RFC 3986 [5], so the text and the 640 references have been updated. In particular, RFC3986 defines a 641 fragment identifier to be part of the URI, whereas in the 642 obsoleted RFC 2396 URI specification, it was not part of a URI as 643 such, but of a "URI reference". 645 o IRIs are now defined by RFC 3987 [8], so the text and the 646 references have been updated. 648 o Changed IPR clause from RFC 3667 to RFC 3978 (updated version of 649 RFC 3667). 651 7.3. From -02 to -03 653 o Replaced most occurrences of 'resource' with 'MIME entity', 654 because the result of dereferencing a URI is not the resource 655 itself, but some MIME entity (in our case of type text/plain) 656 representing it. Thanks to Sandro Hawke for pointing this out. 658 o Moved "Open Issues" to the very back of the document. 660 o Added Section 4 to define the processing model for fragment 661 identifiers (moved Section 4.1 from Section 3 to Section 4). 663 o Added hash scheme to make fragment identifiers more robust 664 (Section 2.2). 666 o Changed IPR clause from RFC 2026 to RFC 3667 (updated version of 667 RFC 2026). 669 7.4. From -01 to -02 671 o Fundamental change in semantics: counts turn into positions 672 (between characters or lines), so in order to identify a character 673 or line, ranges must be used (which now use positions to specify 674 the upper and lower bounds of the range). 676 o Made the first value of a range optional as well, so that line=,5 677 also is legal, identifying everything from the start of the MIME 678 entity to the 5th line. 680 o Changed the syntax from paranthesis-style to a more traditional 681 style using equals-signs. 683 7.5. From -00 to -01 685 o Made the second count value of ranges optional, so that something 686 like line(10,) is legal and properly defined. 688 o Added non-normative reference to Internet draft about non-ASCII 689 characters in search strings. 691 o Added Section 1.4 about incremental deployement. 693 o Added more elaborate examples. 695 o Added text about regex buffer overflow problems in Section 6. 697 o Added Section 1.1.1 about line endings in text/plain resources. 699 o Added "Open Issues" to collect open issues regarding this memo 700 (will be deleted in final RFC text). 702 8. References 704 8.1. Normative References 706 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 707 Levels", RFC 2119, March 1997. 709 [2] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 710 Extensions (MIME) Part One: Format of Internet Message Bodies", 711 RFC 2045, November 1996. 713 [3] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 714 Extensions (MIME) Part Two: Media Types", RFC 2046, 715 November 1996. 717 [4] Gellens, R., "The Text/Plain Format and DelSp Parameters", 718 RFC 3676, February 2004. 720 [5] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 721 Resource Identifier (URI): Generic Syntax", RFC 3986, 722 January 2005. 724 [6] International Organization for Standardization, "Information 725 technology - Portable Operating System Interface (POSIX) - Part 726 2: Shell and Utilities", ISO 9945-2, 1993. 728 [7] Crocker, D. and P. Overell, "Augmented BNF for Syntax 729 Specifications: ABNF", RFC 4234, October 2005. 731 [8] Duerst, M. and M. Suignard, "Internationalized Resource 732 Identifiers (IRI)", RFC 3987, January 2005. 734 [9] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, 735 April 1992. 737 8.2. Non-Normative References 739 [10] Connolly, D. and L. Masinter, "The 'text/html' Media Type", 740 RFC 2854, June 2000. 742 [11] DeRose, S., Maler, E., and D. Orchard, "XML Linking Language 743 (XLink) Version 1.0", W3C Recommendation REC-xlink-20010627, 744 June 2001. 746 [12] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646", 747 RFC 2781, February 2000. 749 [13] Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, 750 June 1999. 752 Appendix A. Where to send Comments 754 Please send all comments and questions concerning this document to 755 Erik Wilde. 757 Appendix B. Acknowledgements 759 This document has been prepared using the IETF document DTD described 760 in RFC 2629 [13]. 762 Thanks for comments and suggestions provided by Marcel Baschnagel, 763 John Cowan, Martin Duerst, Benja Fallenstein, Sandro Hawke, Dan Kohn, 764 and Henrik Levkowetz. 766 Author's Address 768 Erik Wilde 769 ETH Zurich 770 ETH-Zentrum 771 8092 Zurich 772 Switzerland 774 Phone: +41-44-6325132 775 Email: net.dret@dret.net 776 URI: http://dret.net/netdret/ 778 Intellectual Property Statement 780 The IETF takes no position regarding the validity or scope of any 781 Intellectual Property Rights or other rights that might be claimed to 782 pertain to the implementation or use of the technology described in 783 this document or the extent to which any license under such rights 784 might or might not be available; nor does it represent that it has 785 made any independent effort to identify any such rights. Information 786 on the procedures with respect to rights in RFC documents can be 787 found in BCP 78 and BCP 79. 789 Copies of IPR disclosures made to the IETF Secretariat and any 790 assurances of licenses to be made available, or the result of an 791 attempt made to obtain a general license or permission for the use of 792 such proprietary rights by implementers or users of this 793 specification can be obtained from the IETF on-line IPR repository at 794 http://www.ietf.org/ipr. 796 The IETF invites any interested party to bring to its attention any 797 copyrights, patents or patent applications, or other proprietary 798 rights that may cover technology that may be required to implement 799 this standard. Please address the information to the IETF at 800 ietf-ipr@ietf.org. 802 Disclaimer of Validity 804 This document and the information contained herein are provided on an 805 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 806 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 807 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 808 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 809 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 810 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 812 Copyright Statement 814 Copyright (C) The Internet Society (2006). This document is subject 815 to the rights, licenses and restrictions contained in BCP 78, and 816 except as set forth therein, the authors retain all their rights. 818 Acknowledgment 820 Funding for the RFC Editor function is currently provided by the 821 Internet Society.