idnits 2.17.1 draft-wilde-text-fragment-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1.a on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 792. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 769. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 776. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 782. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: This document is an Internet-Draft and is subject to all provisions of Section 3 of RFC 3667. By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 2 instances of too long lines in the document, the longest one being 27 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 103 has weird spacing: '...r allow forma...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 21, 2004) is 7058 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2396 (ref. '5') (Obsoleted by RFC 3986) -- Possible downref: Non-RFC (?) normative reference: ref. '6' ** Obsolete normative reference: RFC 2234 (ref. '7') (Obsoleted by RFC 4234) ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref. '8') -- Obsolete informational reference (is this intentional?): RFC 2629 (ref. '13') (Obsoleted by RFC 7749) Summary: 10 errors (**), 0 flaws (~~), 4 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group E. Wilde 3 Internet-Draft ETH Zurich 4 Expires: June 21, 2005 December 21, 2004 6 URI Fragment Identifiers for the text/plain Media Type 7 draft-wilde-text-fragment-03 9 Status of this Memo 11 This document is an Internet-Draft and is subject to all provisions 12 of section 3 of RFC 3667. By submitting this Internet-Draft, each 13 author represents that any applicable patent or other IPR claims of 14 which he or she is aware have been or will be disclosed, and any of 15 which he or she become aware will be disclosed, in accordance with 16 RFC 3668. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as 21 Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on June 21, 2005. 36 Copyright Notice 38 Copyright (C) The Internet Society (2004). 40 Abstract 42 This memo defines URI fragment identifiers for text/plain MIME 43 entities. These fragment identifiers make it possible to refer to 44 parts of a text MIME entity, identified by character count or range, 45 line count or range, or a regular expression. These identification 46 methods can be combined to identify more than one sub-resource of a 47 text/plain MIME entity. Fragment identifiers may also contain hash 48 information to make them more robust. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 53 1.1 What is text/plain? . . . . . . . . . . . . . . . . . . . 3 54 1.1.1 Line Endings in text/plain MIME Entities . . . . . . . 3 55 1.2 What is a URI Fragment Identifier? . . . . . . . . . . . . 4 56 1.3 Why text/plain Fragment Identifiers? . . . . . . . . . . . 4 57 1.4 Incremental Deployment . . . . . . . . . . . . . . . . . . 5 58 2. Fragment Identification Methods . . . . . . . . . . . . . . . 5 59 2.1 Fragment Identification Schemes . . . . . . . . . . . . . 6 60 2.1.1 Principles . . . . . . . . . . . . . . . . . . . . . . 6 61 2.1.2 Combining the Principles . . . . . . . . . . . . . . . 7 62 2.1.3 Regular Expressions . . . . . . . . . . . . . . . . . 8 63 2.1.4 Combining Fragment Identification Scheme Parts . . . . 8 64 2.2 Fragment Identifier Robustness . . . . . . . . . . . . . . 9 65 3. Fragment Identification Syntax . . . . . . . . . . . . . . . . 9 66 3.1 Non-ASCII Characters in Regular Expressions . . . . . . . 10 67 3.2 Hash Sums . . . . . . . . . . . . . . . . . . . . . . . . 10 68 4. Fragment Identifier Processing . . . . . . . . . . . . . . . . 11 69 4.1 Handling of position Values . . . . . . . . . . . . . . . 11 70 4.2 Handling of Hash Sums . . . . . . . . . . . . . . . . . . 11 71 4.3 Syntax Errors in Fragment Identifiers . . . . . . . . . . 11 72 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 73 6. Security Considerations . . . . . . . . . . . . . . . . . . . 13 74 7. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 13 75 7.1 From -02 to -03 . . . . . . . . . . . . . . . . . . . . . 13 76 7.2 From -01 to -02 . . . . . . . . . . . . . . . . . . . . . 13 77 7.3 From -00 to -01 . . . . . . . . . . . . . . . . . . . . . 13 78 8. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 14 79 8.1 To Do . . . . . . . . . . . . . . . . . . . . . . . . . . 14 80 8.2 Open Questions . . . . . . . . . . . . . . . . . . . . . . 14 81 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 15 82 9.1 Normative References . . . . . . . . . . . . . . . . . . . . 15 83 9.2 Non-Normative References . . . . . . . . . . . . . . . . . . 16 84 Author's Address . . . . . . . . . . . . . . . . . . . . . . . 16 85 A. POSIX BRE Syntax . . . . . . . . . . . . . . . . . . . . . . . 16 86 B. Where to send Comments . . . . . . . . . . . . . . . . . . . . 16 87 C. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17 88 Intellectual Property and Copyright Statements . . . . . . . . 18 90 1. Introduction 92 Compliant software MUST follow this specification. The capitalized 93 key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 94 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 95 document are to be interpreted as described in RFC 2119 [1]. 97 1.1 What is text/plain? 99 Internet Media Types as defined in RFC 2045 [2] and RFC 2046 [3] are 100 used to identify different types and sub-types of media. RFC 2046 101 [3] and RFC 3676 [4] specify the text/plain media type, which is used 102 for simple, unformatted text. Quoting from RFC 2046 [3]: "Plain text 103 does not provide for or allow formatting commands, font attribute 104 specifications, processing instructions, interpretation directives, 105 or content markup. Plain text is seen simply as a linear sequence of 106 characters, possibly interrupted by line breaks or page breaks." 108 The text/plain media type does not restrict the character encoding, 109 any character encoding may be used. In the absence of an explicit 110 character encoding declaration, US-ASCII is assumed as the default 111 character encoding. This variability of the character encoding makes 112 it impossible to count characters in a text/plain MIME entity without 113 taking the character encoding into account, because there are many 114 character encodings using more than one octet per character. 116 The biggest advantage of text/plain MIME entities is their ease of 117 use and their portability among different platforms. As long as they 118 use popular character encodings (such as US-ASCII), they can be 119 displayed and processed on virtually every computer system. 121 1.1.1 Line Endings in text/plain MIME Entities 123 RFC 2046 [3] and RFC 3676 [4] specify that line endings in text/plain 124 MIME entities are represented by CR+LF character sequences. In 125 implementation practice, however, text/plain MIME entities use 126 different conventions, for example depending on the operating system 127 they have been created with (in most cases, Unix uses LF, MacOS uses 128 CR, and Windows uses CR+LF). Because of this diversity of 129 conventions, implementations interpreting text/plain fragment 130 identifiers MUST take different line ending conventions into account. 132 Line endings in text/plain MIME entities MAY be represented by other 133 character (sequences) than CR+LF, specifically CR, LF, NEL, and 134 CR+NEL. All these character (sequences) MUST be interpreted as line 135 endings. This interpretation MUST affect the evaluation of 136 text/plain fragment identifiers. All representations of line endings 137 (CR+LF, CR, LF, NEL, and CR+NEL) MUST be treated as a single 138 character in character counts. For the purpose of regular expression 139 matching, all representations of line endings MUST be treated as 140 single LF characters. The reason for this is that fragment 141 identifiers should not be broken by converting a file from one line 142 ending convention to another. 144 In general, the line ending conventions used in text/plain MIME 145 entities depends on the character encoding of the MIME entity. 146 Implementations SHOULD attempt to be as accurate as possible in 147 recognizing line ending specific to particular character encodings, 148 and MUST treat all these line endings as one character in character 149 counts, and single LF characters for regular expression matching. 151 1.2 What is a URI Fragment Identifier? 153 URIs are the identification mechanism for resources on the Web. The 154 URI syntax specified in RFC 2396 [5] includes as part of a URI 155 reference a fragment identifier, which (quoting from RFC 2396 [5]) 156 "consists of additional reference information to be interpreted by 157 the user agent after the retrieval action has been successfully 158 completed. As such, it is not part of a URI, but is often used in 159 conjunction with a URI. The semantics of a fragment identifier is a 160 property of the data resulting from a retrieval action, regardless of 161 the type of URI used in the reference. Therefore, the format and 162 interpretation of fragment identifiers is dependent on the media type 163 of the retrieval result." 165 The most popular fragment identifier is defined for text/html 166 (defined in RFC 2854 [9]), and makes it possible to refer to a 167 specific element (identified by a 'name' or 'id' attribute) of an 168 HTML document. 170 1.3 Why text/plain Fragment Identifiers? 172 Referring to specific parts of a resource can be very useful, because 173 it enables users and applications to create more specific references. 174 Rather than pointing to a whole resource, users can create references 175 to the part they really are interested in or want to talk about. 176 Even though it is suggested that fragment identification methods are 177 specified in a media type's MIME registration, many media types do 178 not have fragment identification methods associated with them. 180 Fragment identifiers are only useful if supported by the client, 181 because they are only interpreted by the client. Therefore, a new 182 fragment identification method will require some time to be adopted 183 by clients, and older clients will not support it. However, because 184 the URI reference still works even if the fragment identifier is not 185 supported (the resource is retrieved, but the fragment identifier is 186 not interpreted), rapid adoption is not highly critical to ensure the 187 success of a new fragment identification method. 189 Fragment identifiers for text/plain make it possible to refer to 190 specific parts of a text MIME entity, using concepts of positions and 191 ranges, which may be applied to characters and lines. The also 192 support locating a fragment by using a regular expression for 193 searching for a specific character sequence. Thus, text/plain 194 fragment identifiers enable users to exchange information more 195 specifically, thereby reducing time and effort that is necessary to 196 manually search for the relevant part of a text/plain MIME entity. 198 1.4 Incremental Deployment 200 As long as support for text/plain fragment identifiers is not 201 implemented by all programs, it is important to consider the 202 implications of incremental deployment. Clients (for example, Web 203 browsers) not supporting the text/plain fragment identifier described 204 in this memo will work with URI references to text/plain MIME 205 entities, but they will fail to locate the sub-resource identified by 206 the fragment identifier. This is a reasonable fallback behavior, and 207 in general users should take into account the possibility that a 208 program interpreting a given URI reference will fail to interpret the 209 fragment identifier part. Since fragment identifier evaluation is 210 local to the client (and happens after retrieving the MIME entity), 211 there is no way for a server to determine whether a requesting client 212 is using a URI reference containing a fragment identifier. 214 2. Fragment Identification Methods 216 The identification of fragments of text/plain MIME entities can be 217 based on different foundations. Since it is not possible to insert 218 explicit, invisible identifiers into a text/plain MIME entity (as for 219 example used in HTML documents, implemented through special 220 attributes), fragment identification has to rely on certain inherent 221 criteria of the MIME entity. This memo specifies fragment 222 identification using six different methods, which are character 223 positions and ranges, line positions and ranges, regular expression 224 matching, and a mechanism for improving the robustness of fragment 225 identifiers (entity hashes). 227 When interpreting character or line numbers, implementations MUST 228 take the character encoding of the MIME entity into account, because 229 character count and octet count may differ for the character encoding 230 being used. For example, a MIME entity using UTF-16 encoding (as 231 specified in RFC 2718 [10]) uses two octets per character, and it may 232 have a leading BOM (Byte-Order Mark), which does not count as a 233 character and thus also affects the mapping from a simple octet count 234 to a character count. 236 2.1 Fragment Identification Schemes 238 Fragment identification can be done using regular expressions or 239 combining two orthogonal principles, which are positions and ranges, 240 and characters and lines. The following section describe the 241 principles themselves, while Section 2.1.2 describes the combination 242 of the principles. 244 2.1.1 Principles 246 2.1.1.1 Positions and Ranges 248 A position does not identify an actual fragment of the MIME entity, 249 but a position inside the MIME entity, which could be regarded as a 250 fragment of zero length. The use case for positions is to provide 251 pointers for applications which may use them to implement 252 functionalities such as "insert some text here", which needs a 253 position rather than a fragment. Positions are counted from zero 254 (position zero being before the first character or line of a 255 text/plain MIME entity), so that a text/plain MIME entity having one 256 character has two positions, one before the first character (position 257 0), and one after the first character (position 1). 259 Since positions are fragments of length zero, applications SHOULD use 260 other methods than highlighting to indicate positions, the most 261 obvious way being the positioning of a cursor (if the application 262 supports the concept of a cursor). 264 Ranges, on the other hand, identify fragments of a MIME entity that 265 have a length that may be greater than zero. As a general principle 266 for ranges, they specify both a lower and a upper bound. The start 267 or the end of a range specification may be omitted, defaulting to the 268 first repectively last position of the MIME entity. The ending 269 position of a range must have a value greater than or equal to the 270 lower position (consequently, a range with identical lower and upper 271 positions is legal, and identifies a range of length 0, which is 272 equivalent to a position). Counting for ranges uses positions, so 273 that a fragment containing one entity is specified by using a range 274 with two adjacent positions. 276 Since ranges are fragments with a length greater than zero, 277 applications SHOULD use methods like highlighting to indicate ranges 278 (if the application supports the concept of highlighting). 280 For positions and ranges it is implicitly assumed that if a number is 281 greater than the actual number of elements in the MIME entity, then 282 it is referring to the last element of the MIME entity (see Section 4 283 for the processing model). 285 2.1.1.2 Characters and Lines 287 The concept of positions and ranges may be applied to characters and 288 lines. In both cases, positions indicate points between entities, 289 while ranges identify zero or more entities by indicating positions. 291 Character positions are numbered starting with zero (ignoring initial 292 BOM marks or similar concepts that are not part of the actual textual 293 content of a text/plain MIME entity), and counting each character 294 separately, with the exception of line endings, which are always 295 counted as one character (Section 1.1.1 describes how line endings 296 MUST be identified). 298 Line positions are numbered starting with zero (with line position 299 zero always being identical with character position zero), with 300 Section 1.1.1 describing how line endings MUST be identified. 301 Fragments identified by lines include the line endings, so 302 applications identifying line-based fragments MUST include the line 303 endings in the fragment identification they are using (eg, the 304 highlighted selection). If a MIME entity does not contain any line 305 endings, then it consists of a single (the first) line. 307 2.1.2 Combining the Principles 309 In the following sections, the principles described in the preceding 310 section (positions/ranges and characters/lines) are combined, 311 resulting in four use cases. 313 2.1.2.1 Character Position 315 Using the char scheme followed by a single number, it is possible to 316 point to a character position (ie, a fragment of length zero between 317 two characters). Rather than identifying a fragment consisting of a 318 number of characters, this method identifies a position between two 319 characters (or before the first or after the last character). 320 Character position counting starts with 0, so the character position 321 before the first character of a text/plain MIME entity has the 322 character position 0, and a MIME entity containing n distinct 323 characters has n+1 distinct character positions, the last one having 324 the character position n. 326 2.1.2.2 Character Range 328 If it is necessary to identify a fragment of one or more characters 329 using character counting, this can be done by using a character 330 range, using the char scheme followed by a range specification. A 331 character range is a consecutive region of the MIME entity that 332 extends from the starting character position of the range to the 333 ending character position of the range. 335 2.1.2.3 Line Position 337 Using the line scheme followed by a single number, it is possible to 338 point to a line position (ie, a fragment of length zero between two 339 lines). Rather than identifying a fragment consisting of a number of 340 lines, this method identifies a position between two lines (or before 341 the first or after the last line). Line position counting starts 342 with 0, so the line position before the first line of a text/plain 343 MIME entity has the line position 0, and a MIME entity containing n 344 distinct lines has n+1 distinct line positions, the last one having 345 the line position n. 347 2.1.2.4 Line Range 349 If it is necessary to identify a fragment of one or more lines using 350 line counting, this can be done by using a line range, using the line 351 scheme followed by a range specification. A line range is a 352 consecutive region of the MIME entity that extends from the starting 353 line position of the range to the ending line position of the range. 355 2.1.3 Regular Expressions 357 A common problem with fragment identifiers is their robustness (to 358 changes in the MIME entity), and character and line counts can break 359 very easily. A more robust way of identifying a fragment is by 360 searching for a specific pattern (another way of making fragment 361 identifiers more robust is described in Section 2.2 about including 362 entity hash sums in the fragment identifier). Thus, it is possible 363 to use a Basic Regular Expression (BRE) as defined by ISO 9945-2 [6] 364 (the POSIX standard) as a fragment identifier (Appendix A contains a 365 short summary of the POSIX BRE syntax). 367 2.1.4 Combining Fragment Identification Scheme Parts 369 While in most cases only one fragment identification scheme part will 370 be used, it is possible to combine them. By simply concatenating 371 different fragment identification scheme parts, separated by a 372 semicolon, the whole fragment identifier refers to the union of all 373 fragments of the text/plain MIME entity identified by the individual 374 fragment identification scheme parts. This way, it is possible to 375 identify disjoint ranges, such as multiple line ranges. 377 It should be noticed that regular expressions by themselves may 378 identify disjoint fragments, which is true in any case where the 379 regular expression matches more than one occurrence in the MIME 380 entity. 382 Since disjoint fragments can be identified, implementations SHOULD 383 make sure that these fragments are appropriately marked, for example 384 by highlighting the fragment (rather than only scrolling to some 385 line, which only identifies a single position in the MIME entity). 386 If an implementation can not mark disjoint fragments, it MAY resort 387 to marking only the first of the disjoint fragments. However, the 388 exact method of how implementations deal with disjoint fragments 389 depends on the application and interface, and is beyond the scope of 390 this memo. 392 2.2 Fragment Identifier Robustness 394 While regular expressions (as described in Section 2.1.3) may make 395 fragment identifiers more robust than character or line counts, it is 396 still possible that modifications of the resource will break the 397 fragment identifier. If applications want to create more robust 398 fragment identifiers, they may do so by adding hash sums to fragment 399 identifiers. These hash sums are used to detect a change in the 400 resource, so that applications may warn users about the possibility 401 that a fragment identifier might have been broken by a modification 402 of the resource. Since fragment identifiers are interpreted by 403 clients, hash sums are defined on MIME entities rather than the 404 resource itself, and as such are specific to a certain representation 405 of the resource, in case of text/plain resources the character 406 encoding of MIME entity. 408 Hash sums may specify the character encoding that has been used when 409 creating the hash sums, and if such a specification is present, 410 clients MUST check whether the character encoding specified for the 411 hash sum and the character encoding of the retrieved MIME entity are 412 equal, and clients MUST NOT check the hash sum if these values 413 differ. 415 3. Fragment Identification Syntax 417 The syntax for the fragment identifiers is straightforward. The 418 syntax defines four schemes, 'char', 'line', 'match', and hash (which 419 can either be 'length' or 'md5'). The 'char' and 'line' can be used 420 in two different variants, either the position variant (with a single 421 number), or the range variant (with two comma-separated positions). 422 The 'match' scheme has a regular expression as parameter, which must 423 be specified as a string with escaped semicolons (because the 424 semicolon is used to concatenate multiple fragment identification 425 scheme parts). The hash scheme can either use the 'length' or the 426 'md5' scheme to specify a hash value. 428 The following syntax definition uses ABNF as defined in RFC 2234 [7]. 430 text-fragment = text-scheme 0*( ";" text-scheme) 0*( ";" hash-scheme) 431 text-scheme = ( char-scheme / line-scheme / match-scheme ) 432 hash-scheme = ( length-scheme / md5-scheme ) [ "," charenc ] 433 char-scheme = "char=" ( position / range ) 434 line-scheme = "line=" ( position / range ) 435 match-scheme = "match=" regex 436 position = number 437 range = (position "," [ position ]) / ("," position ) 438 number = 1*( DIGIT ) 439 regex = StringWithEscapedSemicolon 440 length-scheme = "length=" number 441 md5-scheme = "md5=" md5-value 442 md5-value = 32( hexdigit ) 443 hexdigit = (DIGIT / "a" / "A" / "b" / "B" / "c" / "C" / "d" / "D" / "e" / "E" / "f" / "F" ) 444 charenc = StringWithEscapedSemicolon 446 The StringWithEscapedSemicolon is a string where all characters may 447 appear literally (except the characters which are required by the URI 448 syntax to be escaped), with the exception of a semicolon. A 449 semicolon that should be part of the regular expression must be 450 escaped with a leading backslash, and implementations MUST make sure 451 to properly interpret regular expressions, properly dereferencing all 452 escape mechanisms that apply (ie, URI encoding, semicolon escaping, 453 and BRE escaping, as well as any additional escaping that may be 454 present due to the context of the URI reference). 456 3.1 Non-ASCII Characters in Regular Expressions 458 RFC 2396 [5] does not define how to use non-ASCII characters in URIs. 459 Consequently, it is not possible to use non-ASCII characters in URIs 460 in a standardized and reliable way. However, work on 461 Internationalized Resource Identifiers (IRI) [11] is in progress, and 462 as soon as this work results in a published RFC, it will be possible 463 to use non-ASCII characters in regular expressions, using the 464 encoding defined by IRI. 466 3.2 Hash Sums 468 A hash sum can either specify a MIME entity's length, or its MD5 469 fingerprint. In both cases, it can optionally specify the character 470 encoding which had been used when calculating the hash sum, so that 471 clients interpreting the fragment identifier may check whether they 472 are using the same character encoding for their calculations. The 473 length of a text/plain MIME entity is calculated by using the 474 principles defined in Section 2.1.1.2. The MD5 fingerprint of a 475 text/plain MIME entity is calculated by using the algorithm presented 476 in [8], encoding the result in 16 hexadecimal digits (using uppercase 477 or lowercase letters) as a representation of the 128 bit which are 478 the result of the MD5 algorithm. 480 4. Fragment Identifier Processing 482 4.1 Handling of position Values 484 If any position value (as a position or inside a range) is greater 485 than the value for the actual MIME entity, then it identifies the 486 last character or line position of the MIME entity. If the first 487 position value in a range is not present, then the range extends from 488 the start of the MIME entity. If the second position value in a 489 range is not present, then the range extends to the end of the MIME 490 entity. If a range scheme's positions are not properly ordered (ie, 491 the first number is less than the second), then this scheme part MUST 492 be ignored. 494 4.2 Handling of Hash Sums 496 If a fragment identifier contains a hash sum, and a client retrieves 497 a MIME entity and detects that the hash sum has changed (observing 498 the character encoding specification, if present), then the client 499 MUST NOT interpret any other text/plain fragment identifier scheme 500 part. A client MAY signal this situation to the user. 502 4.3 Syntax Errors in Fragment Identifiers 504 If a fragment identifier contains a syntax error (i.e., does not 505 conform to the syntax specified in Section 3), then it MUST be 506 ignored by clients. Clients SHOULD NOT make any attempt to correct 507 or guess fragment identifiers. Syntax errors MAY be reported by 508 clients. 510 5. Examples 512 The following examples show some usages for the fragment identifiers 513 defined in this memo. 515 http://example.com/text.txt#char=100 517 This URI reference identifies the position after the 100th character 518 of the text.txt MIME entity. It should be noted that it is not clear 519 which octet(s) of the MIME entity this will be without retrieving the 520 MIME entity and thus knowing which character encoding is it using (in 521 case of HTTP, this information will be given in the response's 522 Content-type header). If the MIME entity has fewer than 100 523 characters, the URI reference identifies the position after the MIME 524 entity's last character. 526 http://example.com/text.txt#line=10,20 528 This URI reference identifies lines 11 to 20 of the text.txt MIME 529 entity. If the MIME entity has fewer than 11 lines, it identifies 530 the position after last line. If the MIME entity has less than 20 531 but at least 11 lines, it identifies the lines 11 to the last line of 532 the MIME entity. 534 http://example.com/text.txt#match=searchterm 536 This URI reference identifies all occurrences of the regular 537 expression 'searchterm' in the MIME entity, ie all occurrences of the 538 string 'searchterm'. If there is more than one occurrence, then this 539 URI reference identifies a disjoint fragment, consisting of all of 540 these occurrences. If there is no occurrence of the search term, the 541 URI reference does not identify a fragment. 543 http://example.com/text.txt#line=,1;match=searchterm 545 This URI reference identifies the first line and all occurrences of 546 the regular expression 'searchterm' in the MIME entity. If there is 547 an occurrence of 'searchterm' outside of the first line, then this 548 URI reference identifies a disjoint fragment. 550 http://example.com/text.txt#match=hello\; 552 This URI reference identifies all occurrences of the regular 553 expression 'hello;' in the MIME entity. The semicolon with the 554 leading backslash has to be interpreted as a literal semicolon inside 555 of the BRE, treating the '\;' as an escaped ';', so that the actual 556 regular expression is 'hello;'. If there is more than one occurrence 557 of this regular expression, then this URI reference identifies a 558 disjoint fragment, consisting of all of these occurrences. 560 ... 562 (more complex example...) 564 6. Security Considerations 566 Regular expression matching code is notoriously vulnerable to buffer 567 overflow security holes, so any implementation supporting text/plain 568 fragment identifiers SHOULD make sure that the code being used has 569 been tested against buffer overflow attacks. 571 7. Change Log 573 7.1 From -02 to -03 575 o Replaced most occurrences of 'resource' with 'MIME entity', 576 because the result of dereferencing a URI is not the resource 577 itself, but some MIME entity (in our case of type text/plain) 578 representing it. Thanks to Sandro Hawke for pointing this out. 580 o Moved Section 8 to the very back of the document. 582 o Added Section 4 to define the processing model for fragment 583 identifiers (moved Section 4.1 from Section 3 to Section 4). 585 o Added hash scheme to make fragment identifiers more robust 586 (Section 2.2). 588 o Changed IPR clause from RFC 2026 to RFC 3667 (updated version of 589 RFC 2026) 591 7.2 From -01 to -02 593 o Fundamental change in semantics: counts turn into positions 594 (between characters or lines), so in order to identify a character 595 or line, ranges must be used (which now use positions to specify 596 the upper and lower bounds of the range). 598 o Made the first value of a range optional as well, so that line=,5 599 also is legal, identifying everything from the start of the MIME 600 entity to the 5th line. 602 o Changed the syntax from paranthesis-style to a more traditional 603 style using equals-signs. 605 7.3 From -00 to -01 607 o Made the second count value of ranges optional, so that something 608 like line(10,) is legal and properly defined. 610 o Added non-normative reference to Internet draft about non-ASCII 611 characters in search strings. 613 o Added Section 1.4 about incremental deployement. 615 o Added more elaborate examples. 617 o Added text about regex buffer overflow problems in Section 6. 619 o Added Section 1.1.1 about line endings in text/plain resources. 621 o Added Section 8 to collect open issues regarding this memo (will 622 be deleted in final RFC text). 624 8. Open Issues 626 This section will not be part of the final RFC text, it serves as a 627 container to collect to-dos (Section 8.1) and open questions (Section 628 8.2) regarding this memo. 630 8.1 To Do 632 o Allow negative numbers for positions, which are interpreted as 633 counting backwards from the MIME entity's end. 635 o Provide more complex example(s). 637 o Provide short BRE syntax and description in Appendix A (by 638 inclusion or by reference). 640 o Add some text about the importance of having fragment 641 identification capabilities for out-of-line linking methods such 642 as XLink to Section 1.3. 644 o Watch IRI [11] development and update to latest version. 646 8.2 Open Questions 648 o Should regex ranges be allowed (ie, a fragment ranging from one 649 regex match to another regex match)? 651 o Should a more sophisticated regex mechanism than BREs be used? 653 o Regexes by themselves may identify disjoint sub-resources. Should 654 there be a mechanism to say something like "the 5th appearance of 655 the following regex"? Or are users responsible for composing 656 regexes which do not need this kind of additional mechanism? 658 o Is the concatenation of scheme parts (Section 2.1.4) and its 659 semantics of joining the individual fragments a good thing? Or a 660 bad thing? 662 o Should there be more schemes? Or less? 664 o Is it necessary to mention that applications must be able to 665 transcode characters, because the text file and the fragment 666 identifier may use different character encodings? What about 667 character normalization? Should that be addressed or at least 668 mentioned as being out of scope? 670 o MD5 values are now specified as 32 hex digits. An alternative 671 would be the representation as specified by [12], which defines 672 base64 encoding for the 128 bits of the checksum. Should both 673 forms be allowed (hex and base64) or is one enough? If only one, 674 is hex the right choice? 676 9. References 678 9.1 Normative References 680 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 681 Levels", RFC 2119, March 1997. 683 [2] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 684 Extensions (MIME) Part One: Format of Internet Message Bodies", 685 RFC 2045, November 1996. 687 [3] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 688 Extensions (MIME) Part Two: Media Types", RFC 2046, November 689 1996. 691 [4] Gellens, R., "The Text/Plain Format and DelSp Parameters", RFC 692 3676, February 2004. 694 [5] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource 695 Identifiers (URI): Generic Syntax", RFC 2396, August 1998. 697 [6] International Organization for Standardization, "Information 698 technology - Portable Operating System Interface (POSIX) - Part 699 2: Shell and Utilities", ISO 9945-2, 1993. 701 [7] Crocker, D. and P. Overell, "Augmented BNF for Syntax 702 Specifications: ABNF", RFC 2234, November 1997. 704 [8] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, April 705 1992. 707 9.2 Non-Normative References 709 [9] Connolly, D. and L. Masinter, "The 'text/html' Media Type", RFC 710 2854, June 2000. 712 [10] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646", 713 RFC 2781, February 2000. 715 [11] Duerst, M. and M. Suignard, "Internationalized Resource 716 Identifiers (IRI)", draft-duerst-iri-11 (work in progress), Nov 717 2004. 719 [12] Myers, J. and M. Rose, "The Content-MD5 Header Field", RFC 720 1864, October 1995. 722 [13] Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, June 723 1999. 725 Author's Address 727 Erik Wilde 728 ETH Zurich 729 ETH-Zentrum 730 8092 Zurich 731 Switzerland 733 Phone: +41-1-6325132 734 EMail: net.dret@dret.net 735 URI: http://dret.net/netdret/ 737 Appendix A. POSIX BRE Syntax 739 This section contains a short (and non-normative) summary of the 740 POSIX BRE syntax defined in ISO 9945-2 [6]. The definition of BRE 741 syntax in ISO 9945-2 [6] is the normative reference, and the 742 following summary is for informative purposes only. 744 (tbd - is there some rfc that could be referenced instead?) 746 Appendix B. Where to send Comments 748 Please send all comments and questions concerning this document to 749 Erik Wilde. 751 Appendix C. Acknowledgements 753 This document has been prepared using the IETF document DTD described 754 in RFC 2629 [13]. 756 Thanks for comments and suggestions provided by Dan Kohn, John Cowan, 757 Benja Fallenstein, Henrik Levkowetz, Sandro Hawke, and Marcel 758 Baschnagel. 760 Intellectual Property Statement 762 The IETF takes no position regarding the validity or scope of any 763 Intellectual Property Rights or other rights that might be claimed to 764 pertain to the implementation or use of the technology described in 765 this document or the extent to which any license under such rights 766 might or might not be available; nor does it represent that it has 767 made any independent effort to identify any such rights. Information 768 on the procedures with respect to rights in RFC documents can be 769 found in BCP 78 and BCP 79. 771 Copies of IPR disclosures made to the IETF Secretariat and any 772 assurances of licenses to be made available, or the result of an 773 attempt made to obtain a general license or permission for the use of 774 such proprietary rights by implementers or users of this 775 specification can be obtained from the IETF on-line IPR repository at 776 http://www.ietf.org/ipr. 778 The IETF invites any interested party to bring to its attention any 779 copyrights, patents or patent applications, or other proprietary 780 rights that may cover technology that may be required to implement 781 this standard. Please address the information to the IETF at 782 ietf-ipr@ietf.org. 784 Disclaimer of Validity 786 This document and the information contained herein are provided on an 787 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 788 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 789 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 790 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 791 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 792 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 794 Copyright Statement 796 Copyright (C) The Internet Society (2004). This document is subject 797 to the rights, licenses and restrictions contained in BCP 78, and 798 except as set forth therein, the authors retain all their rights. 800 Acknowledgment 802 Funding for the RFC Editor function is currently provided by the 803 Internet Society.