idnits 2.17.1 

draft-wilde-text-fragment-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1.a on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 792.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 769.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 776.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 782.

  ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure
     Acknowledgement. 

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.

  ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate
     instead of verbatim RFC 3978 boilerplate.  After 6 May 2005, submission
     of drafts without verbatim RFC 3978 boilerplate is not accepted.

     The following non-3978 patterns matched text found in the document. 
     That text should be removed or replaced:

        This document is an Internet-Draft and is subject to all provisions of
        Section 3 of RFC 3667.

        By submitting this Internet-Draft, each author represents that any
        applicable patent or other IPR claims of which he or she is aware
        have been or will be disclosed, and any of which he or she
        becomes aware will be disclosed, in accordance with Section 6 of
        BCP 79.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There are 2 instances of too long lines in the document, the longest one
     being 27 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 103 has weird spacing: '...r allow  forma...'

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (December 21, 2004) is 7058 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 2396 (ref. '5') (Obsoleted by RFC 3986)

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'

  ** Obsolete normative reference: RFC 2234 (ref. '7') (Obsoleted by RFC 4234)

  ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref. '8')

  -- Obsolete informational reference (is this intentional?): RFC 2629 (ref.
     '13') (Obsoleted by RFC 7749)


     Summary: 10 errors (**), 0 flaws (~~), 4 warnings (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                           E. Wilde
3	Internet-Draft                                                ETH Zurich
4	Expires: June 21, 2005                                 December 21, 2004

6	         URI Fragment Identifiers for the text/plain Media Type
7	                      draft-wilde-text-fragment-03

9	Status of this Memo

11	   This document is an Internet-Draft and is subject to all provisions
12	   of section 3 of RFC 3667.  By submitting this Internet-Draft, each
13	   author represents that any applicable patent or other IPR claims of
14	   which he or she is aware have been or will be disclosed, and any of
15	   which he or she become aware will be disclosed, in accordance with
16	   RFC 3668.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as
21	   Internet-Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on June 21, 2005.

36	Copyright Notice

38	   Copyright (C) The Internet Society (2004).

40	Abstract

42	   This memo defines URI fragment identifiers for text/plain MIME
43	   entities.  These fragment identifiers make it possible to refer to
44	   parts of a text MIME entity, identified by character count or range,
45	   line count or range, or a regular expression.  These identification
46	   methods can be combined to identify more than one sub-resource of a
47	   text/plain MIME entity.  Fragment identifiers may also contain hash
48	   information to make them more robust.

50	Table of Contents

52	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
53	     1.1   What is text/plain?  . . . . . . . . . . . . . . . . . . .  3
54	       1.1.1   Line Endings in text/plain MIME Entities . . . . . . .  3
55	     1.2   What is a URI Fragment Identifier? . . . . . . . . . . . .  4
56	     1.3   Why text/plain Fragment Identifiers? . . . . . . . . . . .  4
57	     1.4   Incremental Deployment . . . . . . . . . . . . . . . . . .  5
58	   2.  Fragment Identification Methods  . . . . . . . . . . . . . . .  5
59	     2.1   Fragment Identification Schemes  . . . . . . . . . . . . .  6
60	       2.1.1   Principles . . . . . . . . . . . . . . . . . . . . . .  6
61	       2.1.2   Combining the Principles . . . . . . . . . . . . . . .  7
62	       2.1.3   Regular Expressions  . . . . . . . . . . . . . . . . .  8
63	       2.1.4   Combining Fragment Identification Scheme Parts . . . .  8
64	     2.2   Fragment Identifier Robustness . . . . . . . . . . . . . .  9
65	   3.  Fragment Identification Syntax . . . . . . . . . . . . . . . .  9
66	     3.1   Non-ASCII Characters in Regular Expressions  . . . . . . . 10
67	     3.2   Hash Sums  . . . . . . . . . . . . . . . . . . . . . . . . 10
68	   4.  Fragment Identifier Processing . . . . . . . . . . . . . . . . 11
69	     4.1   Handling of position Values  . . . . . . . . . . . . . . . 11
70	     4.2   Handling of Hash Sums  . . . . . . . . . . . . . . . . . . 11
71	     4.3   Syntax Errors in Fragment Identifiers  . . . . . . . . . . 11
72	   5.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
73	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 13
74	   7.  Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 13
75	     7.1   From -02 to -03  . . . . . . . . . . . . . . . . . . . . . 13
76	     7.2   From -01 to -02  . . . . . . . . . . . . . . . . . . . . . 13
77	     7.3   From -00 to -01  . . . . . . . . . . . . . . . . . . . . . 13
78	   8.  Open Issues  . . . . . . . . . . . . . . . . . . . . . . . . . 14
79	     8.1   To Do  . . . . . . . . . . . . . . . . . . . . . . . . . . 14
80	     8.2   Open Questions . . . . . . . . . . . . . . . . . . . . . . 14
81	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 15
82	   9.1   Normative References . . . . . . . . . . . . . . . . . . . . 15
83	   9.2   Non-Normative References . . . . . . . . . . . . . . . . . . 16
84	       Author's Address . . . . . . . . . . . . . . . . . . . . . . . 16
85	   A.  POSIX BRE Syntax . . . . . . . . . . . . . . . . . . . . . . . 16
86	   B.  Where to send Comments . . . . . . . . . . . . . . . . . . . . 16
87	   C.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17
88	       Intellectual Property and Copyright Statements . . . . . . . . 18

90	1.  Introduction

92	   Compliant software MUST follow this specification.  The capitalized
93	   key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
94	   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
95	   document are to be interpreted as described in RFC 2119 [1].

97	1.1  What is text/plain?

99	   Internet Media Types as defined in RFC 2045 [2] and RFC 2046 [3] are
100	   used to identify different types and sub-types of media.  RFC 2046
101	   [3] and RFC 3676 [4] specify the text/plain media type, which is used
102	   for simple, unformatted text.  Quoting from RFC 2046 [3]: "Plain text
103	   does not provide for or allow  formatting commands, font attribute
104	   specifications, processing instructions, interpretation directives,
105	   or content markup.  Plain text is seen simply as a linear sequence of
106	   characters, possibly interrupted by line breaks or page breaks."

108	   The text/plain media type does not restrict the character encoding,
109	   any character encoding may be used.  In the absence of an explicit
110	   character encoding declaration, US-ASCII is assumed as the default
111	   character encoding.  This variability of the character encoding makes
112	   it impossible to count characters in a text/plain MIME entity without
113	   taking the character encoding into account, because there are many
114	   character encodings using more than one octet per character.

116	   The biggest advantage of text/plain MIME entities is their ease of
117	   use and their portability among different platforms.  As long as they
118	   use popular character encodings (such as US-ASCII), they can be
119	   displayed and processed on virtually every computer system.

121	1.1.1  Line Endings in text/plain MIME Entities

123	   RFC 2046 [3] and RFC 3676 [4] specify that line endings in text/plain
124	   MIME entities are represented by CR+LF character sequences.  In
125	   implementation practice, however, text/plain MIME entities use
126	   different conventions, for example depending on the operating system
127	   they have been created with (in most cases, Unix uses LF, MacOS uses
128	   CR, and Windows uses CR+LF).  Because of this diversity of
129	   conventions, implementations interpreting text/plain fragment
130	   identifiers MUST take different line ending conventions into account.

132	   Line endings in text/plain MIME entities MAY be represented by other
133	   character (sequences) than CR+LF, specifically CR, LF, NEL, and
134	   CR+NEL.  All these character (sequences) MUST be interpreted as line
135	   endings.  This interpretation MUST affect the evaluation of
136	   text/plain fragment identifiers.  All representations of line endings
137	   (CR+LF, CR, LF, NEL, and CR+NEL) MUST be treated as a single
138	   character in character counts.  For the purpose of regular expression
139	   matching, all representations of line endings MUST be treated as
140	   single LF characters.  The reason for this is that fragment
141	   identifiers should not be broken by converting a file from one line
142	   ending convention to another.

144	   In general, the line ending conventions used in text/plain MIME
145	   entities depends on the character encoding of the MIME entity.
146	   Implementations SHOULD attempt to be as accurate as possible in
147	   recognizing line ending specific to particular character encodings,
148	   and MUST treat all these line endings as one character in character
149	   counts, and single LF characters for regular expression matching.

151	1.2  What is a URI Fragment Identifier?

153	   URIs are the identification mechanism for resources on the Web.  The
154	   URI syntax specified in RFC 2396 [5] includes as part of a URI
155	   reference a fragment identifier, which (quoting from RFC 2396 [5])
156	   "consists of additional reference information to be interpreted by
157	   the user agent after the retrieval action has been successfully
158	   completed.  As such, it is not part of a URI, but is often used in
159	   conjunction with a URI.  The semantics of a fragment identifier is a
160	   property of the data resulting from a retrieval action, regardless of
161	   the type of URI used in the reference.  Therefore, the format and
162	   interpretation of fragment identifiers is dependent on the media type
163	   of the retrieval result."

165	   The most popular fragment identifier is defined for text/html
166	   (defined in RFC 2854 [9]), and makes it possible to refer to a
167	   specific element (identified by a 'name' or 'id' attribute) of an
168	   HTML document.

170	1.3  Why text/plain Fragment Identifiers?

172	   Referring to specific parts of a resource can be very useful, because
173	   it enables users and applications to create more specific references.
174	   Rather than pointing to a whole resource, users can create references
175	   to the part they really are interested in or want to talk about.
176	   Even though it is suggested that fragment identification methods are
177	   specified in a media type's MIME registration, many media types do
178	   not have fragment identification methods associated with them.

180	   Fragment identifiers are only useful if supported by the client,
181	   because they are only interpreted by the client.  Therefore, a new
182	   fragment identification method will require some time to be adopted
183	   by clients, and older clients will not support it.  However, because
184	   the URI reference still works even if the fragment identifier is not
185	   supported (the resource is retrieved, but the fragment identifier is
186	   not interpreted), rapid adoption is not highly critical to ensure the
187	   success of a new fragment identification method.

189	   Fragment identifiers for text/plain make it possible to refer to
190	   specific parts of a text MIME entity, using concepts of positions and
191	   ranges, which may be applied to characters and lines.  The also
192	   support locating a fragment by using a regular expression for
193	   searching for a specific character sequence.  Thus, text/plain
194	   fragment identifiers enable users to exchange information more
195	   specifically, thereby reducing time and effort that is necessary to
196	   manually search for the relevant part of a text/plain MIME entity.

198	1.4  Incremental Deployment

200	   As long as support for text/plain fragment identifiers is not
201	   implemented by all programs, it is important to consider the
202	   implications of incremental deployment.  Clients (for example, Web
203	   browsers) not supporting the text/plain fragment identifier described
204	   in this memo will work with URI references to text/plain MIME
205	   entities, but they will fail to locate the sub-resource identified by
206	   the fragment identifier.  This is a reasonable fallback behavior, and
207	   in general users should take into account the possibility that a
208	   program interpreting a given URI reference will fail to interpret the
209	   fragment identifier part.  Since fragment identifier evaluation is
210	   local to the client (and happens after retrieving the MIME entity),
211	   there is no way for a server to determine whether a requesting client
212	   is using a URI reference containing a fragment identifier.

214	2.  Fragment Identification Methods

216	   The identification of fragments of text/plain MIME entities can be
217	   based on different foundations.  Since it is not possible to insert
218	   explicit, invisible identifiers into a text/plain MIME entity (as for
219	   example used in HTML documents, implemented through special
220	   attributes), fragment identification has to rely on certain inherent
221	   criteria of the MIME entity.  This memo specifies fragment
222	   identification using six different methods, which are character
223	   positions and ranges, line positions and ranges, regular expression
224	   matching, and a mechanism for improving the robustness of fragment
225	   identifiers (entity hashes).

227	   When interpreting character or line numbers, implementations MUST
228	   take the character encoding of the MIME entity into account, because
229	   character count and octet count may differ for the character encoding
230	   being used.  For example, a MIME entity using UTF-16 encoding (as
231	   specified in RFC 2718 [10]) uses two octets per character, and it may
232	   have a leading BOM (Byte-Order Mark), which does not count as a
233	   character and thus also affects the mapping from a simple octet count
234	   to a character count.

236	2.1  Fragment Identification Schemes

238	   Fragment identification can be done using regular expressions or
239	   combining two orthogonal principles, which are positions and ranges,
240	   and characters and lines.  The following section describe the
241	   principles themselves, while Section 2.1.2 describes the combination
242	   of the principles.

244	2.1.1  Principles

246	2.1.1.1  Positions and Ranges

248	   A position does not identify an actual fragment of the MIME entity,
249	   but a position inside the MIME entity, which could be regarded as a
250	   fragment of zero length.  The use case for positions is to provide
251	   pointers for applications which may use them to implement
252	   functionalities such as "insert some text here", which needs a
253	   position rather than a fragment.  Positions are counted from zero
254	   (position zero being before the first character or line of a
255	   text/plain MIME entity), so that a text/plain MIME entity having one
256	   character has two positions, one before the first character (position
257	   0), and one after the first character (position 1).

259	   Since positions are fragments of length zero, applications SHOULD use
260	   other methods than highlighting to indicate positions, the most
261	   obvious way being the positioning of a cursor (if the application
262	   supports the concept of a cursor).

264	   Ranges, on the other hand, identify fragments of a MIME entity that
265	   have a length that may be greater than zero.  As a general principle
266	   for ranges, they specify both a lower and a upper bound.  The start
267	   or the end of a range specification may be omitted, defaulting to the
268	   first repectively last position of the MIME entity.  The ending
269	   position of a range must have a value greater than or equal to the
270	   lower position (consequently, a range with identical lower and upper
271	   positions is legal, and identifies a range of length 0, which is
272	   equivalent to a position).  Counting for ranges uses positions, so
273	   that a fragment containing one entity is specified by using a range
274	   with two adjacent positions.

276	   Since ranges are fragments with a length greater than zero,
277	   applications SHOULD use methods like highlighting to indicate ranges
278	   (if the application supports the concept of highlighting).

280	   For positions and ranges it is implicitly assumed that if a number is
281	   greater than the actual number of elements in the MIME entity, then
282	   it is referring to the last element of the MIME entity (see Section 4
283	   for the processing model).

285	2.1.1.2  Characters and Lines

287	   The concept of positions and ranges may be applied to characters and
288	   lines.  In both cases, positions indicate points between entities,
289	   while ranges identify zero or more entities by indicating positions.

291	   Character positions are numbered starting with zero (ignoring initial
292	   BOM marks or similar concepts that are not part of the actual textual
293	   content of a text/plain MIME entity), and counting each character
294	   separately, with the exception of line endings, which are always
295	   counted as one character (Section 1.1.1 describes how line endings
296	   MUST be identified).

298	   Line positions are numbered starting with zero (with line position
299	   zero always being identical with character position zero), with
300	   Section 1.1.1 describing how line endings MUST be identified.
301	   Fragments identified by lines include the line endings, so
302	   applications identifying line-based fragments MUST include the line
303	   endings in the fragment identification they are using (eg, the
304	   highlighted selection).  If a MIME entity does not contain any line
305	   endings, then it consists of a single (the first) line.

307	2.1.2  Combining the Principles

309	   In the following sections, the principles described in the preceding
310	   section (positions/ranges and characters/lines) are combined,
311	   resulting in four use cases.

313	2.1.2.1  Character Position

315	   Using the char scheme followed by a single number, it is possible to
316	   point to a character position (ie, a fragment of length zero between
317	   two characters).  Rather than identifying a fragment consisting of a
318	   number of characters, this method identifies a position between two
319	   characters (or before the first or after the last character).
320	   Character position counting starts with 0, so the character position
321	   before the first character of a text/plain MIME entity has the
322	   character position 0, and a MIME entity containing n distinct
323	   characters has n+1 distinct character positions, the last one having
324	   the character position n.

326	2.1.2.2  Character Range

328	   If it is necessary to identify a fragment of one or more characters
329	   using character counting, this can be done by using a character
330	   range, using the char scheme followed by a range specification.  A
331	   character range is a consecutive region of the MIME entity that
332	   extends from the starting character position of the range to the
333	   ending character position of the range.

335	2.1.2.3  Line Position

337	   Using the line scheme followed by a single number, it is possible to
338	   point to a line position (ie, a fragment of length zero between two
339	   lines).  Rather than identifying a fragment consisting of a number of
340	   lines, this method identifies a position between two lines (or before
341	   the first or after the last line).  Line position counting starts
342	   with 0, so the line position before the first line of a text/plain
343	   MIME entity has the line position 0, and a MIME entity containing n
344	   distinct lines has n+1 distinct line positions, the last one having
345	   the line position n.

347	2.1.2.4  Line Range

349	   If it is necessary to identify a fragment of one or more lines using
350	   line counting, this can be done by using a line range, using the line
351	   scheme followed by a range specification.  A line range is a
352	   consecutive region of the MIME entity that extends from the starting
353	   line position of the range to the ending line position of the range.

355	2.1.3  Regular Expressions

357	   A common problem with fragment identifiers is their robustness (to
358	   changes in the MIME entity), and character and line counts can break
359	   very easily.  A more robust way of identifying a fragment is by
360	   searching for a specific pattern (another way of making fragment
361	   identifiers more robust is described in Section 2.2 about including
362	   entity hash sums in the fragment identifier).  Thus, it is possible
363	   to use a Basic Regular Expression (BRE) as defined by ISO 9945-2 [6]
364	   (the POSIX standard) as a fragment identifier (Appendix A contains a
365	   short summary of the POSIX BRE syntax).

367	2.1.4  Combining Fragment Identification Scheme Parts

369	   While in most cases only one fragment identification scheme part will
370	   be used, it is possible to combine them.  By simply concatenating
371	   different fragment identification scheme parts, separated by a
372	   semicolon, the whole fragment identifier refers to the union of all
373	   fragments of the text/plain MIME entity identified by the individual
374	   fragment identification scheme parts.  This way, it is possible to
375	   identify disjoint ranges, such as multiple line ranges.

377	   It should be noticed that regular expressions by themselves may
378	   identify disjoint fragments, which is true in any case where the
379	   regular expression matches more than one occurrence in the MIME
380	   entity.

382	   Since disjoint fragments can be identified, implementations SHOULD
383	   make sure that these fragments are appropriately marked, for example
384	   by highlighting the fragment (rather than only scrolling to some
385	   line, which only identifies a single position in the MIME entity).
386	   If an implementation can not mark disjoint fragments, it MAY resort
387	   to marking only the first of the disjoint fragments.  However, the
388	   exact method of how implementations deal with disjoint fragments
389	   depends on the application and interface, and is beyond the scope of
390	   this memo.

392	2.2  Fragment Identifier Robustness

394	   While regular expressions (as described in Section 2.1.3) may make
395	   fragment identifiers more robust than character or line counts, it is
396	   still possible that modifications of the resource will break the
397	   fragment identifier.  If applications want to create more robust
398	   fragment identifiers, they may do so by adding hash sums to fragment
399	   identifiers.  These hash sums are used to detect a change in the
400	   resource, so that applications may warn users about the possibility
401	   that a fragment identifier might have been broken by a modification
402	   of the resource.  Since fragment identifiers are interpreted by
403	   clients, hash sums are defined on MIME entities rather than the
404	   resource itself, and as such are specific to a certain representation
405	   of the resource, in case of text/plain resources the character
406	   encoding of MIME entity.

408	   Hash sums may specify the character encoding that has been used when
409	   creating the hash sums, and if such a specification is present,
410	   clients MUST check whether the character encoding specified for the
411	   hash sum and the character encoding of the retrieved MIME entity are
412	   equal, and clients MUST NOT check the hash sum if these values
413	   differ.

415	3.  Fragment Identification Syntax

417	   The syntax for the fragment identifiers is straightforward.  The
418	   syntax defines four schemes, 'char', 'line', 'match', and hash (which
419	   can either be 'length' or 'md5').  The 'char' and 'line' can be used
420	   in two different variants, either the position variant (with a single
421	   number), or the range variant (with two comma-separated positions).
422	   The 'match' scheme has a regular expression as parameter, which must
423	   be specified as a string with escaped semicolons (because the
424	   semicolon is used to concatenate multiple fragment identification
425	   scheme parts).  The hash scheme can either use the 'length' or the
426	   'md5' scheme to specify a hash value.

428	   The following syntax definition uses ABNF as defined in RFC 2234 [7].

430	   text-fragment =  text-scheme 0*( ";" text-scheme) 0*( ";" hash-scheme)
431	   text-scheme   =  ( char-scheme / line-scheme / match-scheme )
432	   hash-scheme   =  ( length-scheme / md5-scheme ) [ "," charenc ]
433	   char-scheme   =  "char=" ( position / range )
434	   line-scheme   =  "line=" ( position / range )
435	   match-scheme  =  "match=" regex
436	   position      =  number
437	   range         =  (position "," [ position ]) / ("," position )
438	   number        =  1*( DIGIT )
439	   regex         =  StringWithEscapedSemicolon
440	   length-scheme =  "length=" number
441	   md5-scheme    =  "md5=" md5-value
442	   md5-value     = 32( hexdigit )
443	   hexdigit      = (DIGIT / "a" / "A" / "b" / "B" / "c" / "C" / "d" / "D" / "e" / "E" / "f" / "F" )
444	   charenc       = StringWithEscapedSemicolon

446	   The StringWithEscapedSemicolon is a string where all characters may
447	   appear literally (except the characters which are required by the URI
448	   syntax to be escaped), with the exception of a semicolon.  A
449	   semicolon that should be part of the regular expression must be
450	   escaped with a leading backslash, and implementations MUST make sure
451	   to properly interpret regular expressions, properly dereferencing all
452	   escape mechanisms that apply (ie, URI encoding, semicolon escaping,
453	   and BRE escaping, as well as any additional escaping that may be
454	   present due to the context of the URI reference).

456	3.1  Non-ASCII Characters in Regular Expressions

458	   RFC 2396 [5] does not define how to use non-ASCII characters in URIs.
459	   Consequently, it is not possible to use non-ASCII characters in URIs
460	   in a standardized and reliable way.  However, work on
461	   Internationalized Resource Identifiers (IRI) [11] is in progress, and
462	   as soon as this work results in a published RFC, it will be possible
463	   to use non-ASCII characters in regular expressions, using the
464	   encoding defined by IRI.

466	3.2  Hash Sums

468	   A hash sum can either specify a MIME entity's length, or its MD5
469	   fingerprint.  In both cases, it can optionally specify the character
470	   encoding which had been used when calculating the hash sum, so that
471	   clients interpreting the fragment identifier may check whether they
472	   are using the same character encoding for their calculations.  The
473	   length of a text/plain MIME entity is calculated by using the
474	   principles defined in Section 2.1.1.2.  The MD5 fingerprint of a
475	   text/plain MIME entity is calculated by using the algorithm presented
476	   in [8], encoding the result in 16 hexadecimal digits (using uppercase
477	   or lowercase letters) as a representation of the 128 bit which are
478	   the result of the MD5 algorithm.

480	4.  Fragment Identifier Processing

482	4.1  Handling of position Values

484	   If any position value (as a position or inside a range) is greater
485	   than the value for the actual MIME entity, then it identifies the
486	   last character or line position of the MIME entity.  If the first
487	   position value in a range is not present, then the range extends from
488	   the start of the MIME entity.  If the second position value in a
489	   range is not present, then the range extends to the end of the MIME
490	   entity.  If a range scheme's positions are not properly ordered (ie,
491	   the first number is less than the second), then this scheme part MUST
492	   be ignored.

494	4.2  Handling of Hash Sums

496	   If a fragment identifier contains a hash sum, and a client retrieves
497	   a MIME entity and detects that the hash sum has changed (observing
498	   the character encoding specification, if present), then the client
499	   MUST NOT interpret any other text/plain fragment identifier scheme
500	   part.  A client MAY signal this situation to the user.

502	4.3  Syntax Errors in Fragment Identifiers

504	   If a fragment identifier contains a syntax error (i.e., does not
505	   conform to the syntax specified in Section 3), then it MUST be
506	   ignored by clients.  Clients SHOULD NOT make any attempt to correct
507	   or guess fragment identifiers.  Syntax errors MAY be reported by
508	   clients.

510	5.  Examples

512	   The following examples show some usages for the fragment identifiers
513	   defined in this memo.

515	   http://example.com/text.txt#char=100

517	   This URI reference identifies the position after the 100th character
518	   of the text.txt MIME entity.  It should be noted that it is not clear
519	   which octet(s) of the MIME entity this will be without retrieving the
520	   MIME entity and thus knowing which character encoding is it using (in
521	   case of HTTP, this information will be given in the response's
522	   Content-type header).  If the MIME entity has fewer than 100
523	   characters, the URI reference identifies the position after the MIME
524	   entity's last character.

526	   http://example.com/text.txt#line=10,20

528	   This URI reference identifies lines 11 to 20 of the text.txt MIME
529	   entity.  If the MIME entity has fewer than 11 lines, it identifies
530	   the position after last line.  If the MIME entity has less than 20
531	   but at least 11 lines, it identifies the lines 11 to the last line of
532	   the MIME entity.

534	   http://example.com/text.txt#match=searchterm

536	   This URI reference identifies all occurrences of the regular
537	   expression 'searchterm' in the MIME entity, ie all occurrences of the
538	   string 'searchterm'.  If there is more than one occurrence, then this
539	   URI reference identifies a disjoint fragment, consisting of all of
540	   these occurrences.  If there is no occurrence of the search term, the
541	   URI reference does not identify a fragment.

543	   http://example.com/text.txt#line=,1;match=searchterm

545	   This URI reference identifies the first line and all occurrences of
546	   the regular expression 'searchterm' in the MIME entity.  If there is
547	   an occurrence of 'searchterm' outside of the first line, then this
548	   URI reference identifies a disjoint fragment.

550	   http://example.com/text.txt#match=hello\;

552	   This URI reference identifies all occurrences of the regular
553	   expression 'hello;' in the MIME entity.  The semicolon with the
554	   leading backslash has to be interpreted as a literal semicolon inside
555	   of the BRE, treating the '\;' as an escaped ';', so that the actual
556	   regular expression is 'hello;'.  If there is more than one occurrence
557	   of this regular expression, then this URI reference identifies a
558	   disjoint fragment, consisting of all of these occurrences.

560	   ...

562	   (more complex example...)

564	6.  Security Considerations

566	   Regular expression matching code is notoriously vulnerable to buffer
567	   overflow security holes, so any implementation supporting text/plain
568	   fragment identifiers SHOULD make sure that the code being used has
569	   been tested against buffer overflow attacks.

571	7.  Change Log

573	7.1  From -02 to -03

575	   o  Replaced most occurrences of 'resource' with 'MIME entity',
576	      because the result of dereferencing a URI is not the resource
577	      itself, but some MIME entity (in our case of type text/plain)
578	      representing it.  Thanks to Sandro Hawke for pointing this out.

580	   o  Moved Section 8 to the very back of the document.

582	   o  Added Section 4 to define the processing model for fragment
583	      identifiers (moved Section 4.1 from Section 3 to Section 4).

585	   o  Added hash scheme to make fragment identifiers more robust
586	      (Section 2.2).

588	   o  Changed IPR clause from RFC 2026 to RFC 3667 (updated version of
589	      RFC 2026)

591	7.2  From -01 to -02

593	   o  Fundamental change in semantics: counts turn into positions
594	      (between characters or lines), so in order to identify a character
595	      or line, ranges must be used (which now use positions to specify
596	      the upper and lower bounds of the range).

598	   o  Made the first value of a range optional as well, so that line=,5
599	      also is legal, identifying everything from the start of the MIME
600	      entity to the 5th line.

602	   o  Changed the syntax from paranthesis-style to a more traditional
603	      style using equals-signs.

605	7.3  From -00 to -01

607	   o  Made the second count value of ranges optional, so that something
608	      like line(10,) is legal and properly defined.

610	   o  Added non-normative reference to Internet draft about non-ASCII
611	      characters in search strings.

613	   o  Added Section 1.4 about incremental deployement.

615	   o  Added more elaborate examples.

617	   o  Added text about regex buffer overflow problems in Section 6.

619	   o  Added Section 1.1.1 about line endings in text/plain resources.

621	   o  Added Section 8 to collect open issues regarding this memo (will
622	      be deleted in final RFC text).

624	8.  Open Issues

626	   This section will not be part of the final RFC text, it serves as a
627	   container to collect to-dos (Section 8.1) and open questions (Section
628	   8.2) regarding this memo.

630	8.1  To Do

632	   o  Allow negative numbers for positions, which are interpreted as
633	      counting backwards from the MIME entity's end.

635	   o  Provide more complex example(s).

637	   o  Provide short BRE syntax and description in Appendix A (by
638	      inclusion or by reference).

640	   o  Add some text about the importance of having fragment
641	      identification capabilities for out-of-line linking methods such
642	      as XLink to Section 1.3.

644	   o  Watch IRI [11] development and update to latest version.

646	8.2  Open Questions

648	   o  Should regex ranges be allowed (ie, a fragment ranging from one
649	      regex match to another regex match)?

651	   o  Should a more sophisticated regex mechanism than BREs be used?

653	   o  Regexes by themselves may identify disjoint sub-resources.  Should
654	      there be a mechanism to say something like "the 5th appearance of
655	      the following regex"? Or are users responsible for composing
656	      regexes which do not need this kind of additional mechanism?

658	   o  Is the concatenation of scheme parts (Section 2.1.4) and its
659	      semantics of joining the individual fragments a good thing? Or a
660	      bad thing?

662	   o  Should there be more schemes? Or less?

664	   o  Is it necessary to mention that applications must be able to
665	      transcode characters, because the text file and the fragment
666	      identifier may use different character encodings? What about
667	      character normalization? Should that be addressed or at least
668	      mentioned as being out of scope?

670	   o  MD5 values are now specified as 32 hex digits.  An alternative
671	      would be the representation as specified by [12], which defines
672	      base64 encoding for the 128 bits of the checksum.  Should both
673	      forms be allowed (hex and base64) or is one enough? If only one,
674	      is hex the right choice?

676	9.  References

678	9.1  Normative References

680	   [1]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
681	        Levels", RFC 2119, March 1997.

683	   [2]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
684	        Extensions (MIME) Part One: Format of Internet Message Bodies",
685	        RFC 2045, November 1996.

687	   [3]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
688	        Extensions (MIME) Part Two: Media Types", RFC 2046, November
689	        1996.

691	   [4]  Gellens, R., "The Text/Plain Format and DelSp Parameters", RFC
692	        3676, February 2004.

694	   [5]  Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource
695	        Identifiers (URI): Generic Syntax", RFC 2396, August 1998.

697	   [6]  International Organization for Standardization, "Information
698	        technology - Portable Operating System Interface (POSIX) - Part
699	        2: Shell and Utilities", ISO 9945-2, 1993.

701	   [7]  Crocker, D. and P. Overell, "Augmented BNF for Syntax
702	        Specifications: ABNF", RFC 2234, November 1997.

704	   [8]  Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, April
705	        1992.

707	9.2  Non-Normative References

709	   [9]   Connolly, D. and L. Masinter, "The 'text/html' Media Type", RFC
710	         2854, June 2000.

712	   [10]  Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646",
713	         RFC 2781, February 2000.

715	   [11]  Duerst, M. and M. Suignard, "Internationalized Resource
716	         Identifiers (IRI)", draft-duerst-iri-11 (work in progress), Nov
717	         2004.

719	   [12]  Myers, J. and M. Rose, "The Content-MD5 Header Field", RFC
720	         1864, October 1995.

722	   [13]  Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, June
723	         1999.

725	Author's Address

727	   Erik Wilde
728	   ETH Zurich
729	   ETH-Zentrum
730	   8092 Zurich
731	   Switzerland

733	   Phone: +41-1-6325132
734	   EMail: net.dret@dret.net
735	   URI:   http://dret.net/netdret/

737	Appendix A.  POSIX BRE Syntax

739	   This section contains a short (and non-normative) summary of the
740	   POSIX BRE syntax defined in ISO 9945-2 [6].  The definition of BRE
741	   syntax in ISO 9945-2 [6] is the normative reference, and the
742	   following summary is for informative purposes only.

744	   (tbd - is there some rfc that could be referenced instead?)

746	Appendix B.  Where to send Comments

748	   Please send all comments and questions concerning this document to
749	   Erik Wilde.

751	Appendix C.  Acknowledgements

753	   This document has been prepared using the IETF document DTD described
754	   in RFC 2629 [13].

756	   Thanks for comments and suggestions provided by Dan Kohn, John Cowan,
757	   Benja Fallenstein, Henrik Levkowetz, Sandro Hawke, and Marcel
758	   Baschnagel.

760	Intellectual Property Statement

762	   The IETF takes no position regarding the validity or scope of any
763	   Intellectual Property Rights or other rights that might be claimed to
764	   pertain to the implementation or use of the technology described in
765	   this document or the extent to which any license under such rights
766	   might or might not be available; nor does it represent that it has
767	   made any independent effort to identify any such rights.  Information
768	   on the procedures with respect to rights in RFC documents can be
769	   found in BCP 78 and BCP 79.

771	   Copies of IPR disclosures made to the IETF Secretariat and any
772	   assurances of licenses to be made available, or the result of an
773	   attempt made to obtain a general license or permission for the use of
774	   such proprietary rights by implementers or users of this
775	   specification can be obtained from the IETF on-line IPR repository at
776	   http://www.ietf.org/ipr.

778	   The IETF invites any interested party to bring to its attention any
779	   copyrights, patents or patent applications, or other proprietary
780	   rights that may cover technology that may be required to implement
781	   this standard.  Please address the information to the IETF at
782	   ietf-ipr@ietf.org.

784	Disclaimer of Validity

786	   This document and the information contained herein are provided on an
787	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
788	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
789	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
790	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
791	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
792	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

794	Copyright Statement

796	   Copyright (C) The Internet Society (2004).  This document is subject
797	   to the rights, licenses and restrictions contained in BCP 78, and
798	   except as set forth therein, the authors retain all their rights.

800	Acknowledgment

802	   Funding for the RFC Editor function is currently provided by the
803	   Internet Society.