idnits 2.17.1 

draft-wilde-text-fragment-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 14.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 810.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 787.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 794.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 800.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There are 2 instances of too long lines in the document, the longest one
     being 28 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (Jan 6, 2006) is 6686 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'

  ** Obsolete normative reference: RFC 4234 (ref. '7') (Obsoleted by RFC 5234)

  ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref. '9')

  -- Obsolete informational reference (is this intentional?): RFC 2629 (ref.
     '13') (Obsoleted by RFC 7749)


     Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                           E. Wilde
3	Internet-Draft                                                ETH Zurich
4	Expires: July 10, 2006                                       Jan 6, 2006

6	         URI Fragment Identifiers for the text/plain Media Type
7	                      draft-wilde-text-fragment-05

9	Status of this Memo

11	   By submitting this Internet-Draft, each author represents that any
12	   applicable patent or other IPR claims of which he or she is aware
13	   have been or will be disclosed, and any of which he or she becomes
14	   aware will be disclosed, in accordance with Section 6 of BCP 79.

16	   Internet-Drafts are working documents of the Internet Engineering
17	   Task Force (IETF), its areas, and its working groups.  Note that
18	   other groups may also distribute working documents as Internet-
19	   Drafts.

21	   Internet-Drafts are draft documents valid for a maximum of six months
22	   and may be updated, replaced, or obsoleted by other documents at any
23	   time.  It is inappropriate to use Internet-Drafts as reference
24	   material or to cite them other than as "work in progress."

26	   The list of current Internet-Drafts can be accessed at
27	   http://www.ietf.org/ietf/1id-abstracts.txt.

29	   The list of Internet-Draft Shadow Directories can be accessed at
30	   http://www.ietf.org/shadow.html.

32	   This Internet-Draft will expire on July 10, 2006.

34	Copyright Notice

36	   Copyright (C) The Internet Society (2006).

38	Abstract

40	   This memo defines URI fragment identifiers for text/plain MIME
41	   entities.  These fragment identifiers make it possible to refer to
42	   parts of a text MIME entity, identified by character count or range,
43	   line count or range, or a regular expression.  These identification
44	   methods can be combined to identify more than one sub-resource of a
45	   text/plain MIME entity.  Fragment identifiers may also contain hash
46	   information to make them more robust.

48	Table of Contents

50	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
51	     1.1.  What is text/plain?  . . . . . . . . . . . . . . . . . . .  3
52	       1.1.1.  Line Endings in text/plain MIME Entities . . . . . . .  3
53	     1.2.  What is a URI Fragment Identifier? . . . . . . . . . . . .  4
54	     1.3.  Why text/plain Fragment Identifiers? . . . . . . . . . . .  4
55	     1.4.  Incremental Deployment . . . . . . . . . . . . . . . . . .  5
56	   2.  Fragment Identification Methods  . . . . . . . . . . . . . . .  5
57	     2.1.  Fragment Identification Schemes  . . . . . . . . . . . . .  6
58	       2.1.1.  Principles . . . . . . . . . . . . . . . . . . . . . .  6
59	       2.1.2.  Combining the Principles . . . . . . . . . . . . . . .  7
60	       2.1.3.  Regular Expressions  . . . . . . . . . . . . . . . . .  8
61	       2.1.4.  Combining Fragment Identification Scheme Parts . . . .  9
62	     2.2.  Fragment Identifier Robustness . . . . . . . . . . . . . .  9
63	   3.  Fragment Identification Syntax . . . . . . . . . . . . . . . . 10
64	     3.1.  Non-ASCII Characters in Regular Expressions  . . . . . . . 11
65	     3.2.  Hash Sums  . . . . . . . . . . . . . . . . . . . . . . . . 11
66	   4.  Fragment Identifier Processing . . . . . . . . . . . . . . . . 11
67	     4.1.  Handling of position Values  . . . . . . . . . . . . . . . 11
68	     4.2.  Handling of Hash Sums  . . . . . . . . . . . . . . . . . . 12
69	     4.3.  Syntax Errors in Fragment Identifiers  . . . . . . . . . . 12
70	   5.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
71	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 14
72	   7.  Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 14
73	     7.1.  From -04 to -05  . . . . . . . . . . . . . . . . . . . . . 14
74	     7.2.  From -03 to -04  . . . . . . . . . . . . . . . . . . . . . 14
75	     7.3.  From -02 to -03  . . . . . . . . . . . . . . . . . . . . . 15
76	     7.4.  From -01 to -02  . . . . . . . . . . . . . . . . . . . . . 15
77	     7.5.  From -00 to -01  . . . . . . . . . . . . . . . . . . . . . 15
78	   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
79	     8.1.  Normative References . . . . . . . . . . . . . . . . . . . 16
80	     8.2.  Non-Normative References . . . . . . . . . . . . . . . . . 16
81	   Appendix A.  Where to send Comments  . . . . . . . . . . . . . . . 17
82	   Appendix B.  Acknowledgements  . . . . . . . . . . . . . . . . . . 17
83	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 18
84	   Intellectual Property and Copyright Statements . . . . . . . . . . 19

86	1.  Introduction

88	   Compliant software MUST follow this specification.  The capitalized
89	   key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
90	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
91	   document are to be interpreted as described in RFC 2119 [1].

93	1.1.  What is text/plain?

95	   Internet Media Types as defined in RFC 2045 [2] and RFC 2046 [3] are
96	   used to identify different types and sub-types of media.  RFC 2046
97	   [3] and RFC 3676 [4] specify the text/plain media type, which is used
98	   for simple, unformatted text.  Quoting from RFC 2046 [3]: "Plain text
99	   does not provide for or allow formatting commands, font attribute
100	   specifications, processing instructions, interpretation directives,
101	   or content markup.  Plain text is seen simply as a linear sequence of
102	   characters, possibly interrupted by line breaks or page breaks."

104	   The text/plain media type does not restrict the character encoding,
105	   any character encoding may be used.  In the absence of an explicit
106	   character encoding declaration, US-ASCII is assumed as the default
107	   character encoding.  This variability of the character encoding makes
108	   it impossible to count characters in a text/plain MIME entity without
109	   taking the character encoding into account, because there are many
110	   character encodings using more than one octet per character.

112	   The biggest advantage of text/plain MIME entities is their ease of
113	   use and their portability among different platforms.  As long as they
114	   use popular character encodings (such as US-ASCII), they can be
115	   displayed and processed on virtually every computer system.

117	1.1.1.  Line Endings in text/plain MIME Entities

119	   RFC 2046 [3] and RFC 3676 [4] specify that line endings in text/plain
120	   MIME entities are represented by CR+LF character sequences.  In
121	   implementation practice, however, text/plain MIME entities use
122	   different conventions, for example depending on the operating system
123	   they have been created with (in most cases, Unix uses LF, MacOS uses
124	   CR, and Windows uses CR+LF).  Because of this diversity of
125	   conventions, implementations interpreting text/plain fragment
126	   identifiers MUST take different line ending conventions into account.

128	   Line endings in text/plain MIME entities MAY be represented by other
129	   character (sequences) than CR+LF, specifically CR, LF, NEL, and CR+
130	   NEL.  All these character (sequences) MUST be interpreted as line
131	   endings.  This interpretation MUST affect the evaluation of text/
132	   plain fragment identifiers.  All representations of line endings
133	   (CR+LF, CR, LF, NEL, and CR+NEL) MUST be treated as a single
134	   character in character counts.  For the purpose of regular expression
135	   matching, all representations of line endings MUST be treated as
136	   single LF characters.  The reason for this is that fragment
137	   identifiers should not be broken by converting a file from one line
138	   ending convention to another.

140	   In general, the line ending conventions used in text/plain MIME
141	   entities depends on the character encoding of the MIME entity.
142	   Implementations SHOULD attempt to be as accurate as possible in
143	   recognizing line ending specific to particular character encodings,
144	   and MUST treat all these line endings as one character in character
145	   counts, and single LF characters for regular expression matching.

147	1.2.  What is a URI Fragment Identifier?

149	   URIs are the identification mechanism for resources on the Web. The
150	   URI syntax specified in RFC 3986 [5] includes as part of a URI a
151	   fragment identifier, which (quoting from RFC 3986 [5]) "consists of
152	   additional reference information to be interpreted by the user agent
153	   after the retrieval action has been successfully completed.  As such,
154	   it is not part of a URI, but is often used in conjunction with a URI.
155	   The semantics of a fragment identifier is a property of the data
156	   resulting from a retrieval action, regardless of the type of URI used
157	   in the reference.  Therefore, the format and interpretation of
158	   fragment identifiers is dependent on the media type of the retrieval
159	   result."

161	   The most popular fragment identifier is defined for text/html
162	   (defined in RFC 2854 [10]), and makes it possible to refer to a
163	   specific element (identified by a 'name' or 'id' attribute) of an
164	   HTML document.

166	1.3.  Why text/plain Fragment Identifiers?

168	   Referring to specific parts of a resource can be very useful, because
169	   it enables users and applications to create more specific references.
170	   Rather than pointing to a whole resource, users can create references
171	   to the part they really are interested in or want to talk about.
172	   Even though it is suggested that fragment identification methods are
173	   specified in a media type's MIME registration, many media types do
174	   not have fragment identification methods associated with them.

176	   Fragment identifiers are only useful if supported by the client,
177	   because they are only interpreted by the client.  Therefore, a new
178	   fragment identification method will require some time to be adopted
179	   by clients, and older clients will not support it.  However, because
180	   the URI still works even if the fragment identifier is not supported
181	   (the resource is retrieved, but the fragment identifier is not
182	   interpreted), rapid adoption is not highly critical to ensure the
183	   success of a new fragment identification method.

185	   Fragment identifiers for text/plain make it possible to refer to
186	   specific parts of a text MIME entity, using concepts of positions and
187	   ranges, which may be applied to characters and lines.  The also
188	   support locating a fragment by using a regular expression for
189	   searching for a specific character sequence.  Thus, text/plain
190	   fragment identifiers enable users to exchange information more
191	   specifically, thereby reducing time and effort that is necessary to
192	   manually search for the relevant part of a text/plain MIME entity.

194	   The text/plain format does not support the embedding of links, so in
195	   normal environments, text/plain resources can only serve as targets
196	   for links, and not as sources.  However, when combining the text/
197	   plain fragment identifiers specified in this memo with out-of-line
198	   linking mechanisms such as XLink [11], it is possible to "embed" link
199	   sources into plain/text resources.  Thus, the text/plain fragment
200	   identifiers specified in this memo open a path for plain/text files
201	   to become fully integrated resources in hypermedia systems such as
202	   the Web.

204	1.4.  Incremental Deployment

206	   As long as support for text/plain fragment identifiers is not
207	   implemented by all programs, it is important to consider the
208	   implications of incremental deployment.  Clients (for example, Web
209	   browsers) not supporting the text/plain fragment identifier described
210	   in this memo will work with URI references to text/plain MIME
211	   entities, but they will fail to locate the sub-resource identified by
212	   the fragment identifier.  This is a reasonable fallback behavior, and
213	   in general users should take into account the possibility that a
214	   program interpreting a given URI will fail to interpret the fragment
215	   identifier part.  Since fragment identifier evaluation is local to
216	   the client (and happens after retrieving the MIME entity), there is
217	   no way for a server to determine whether a requesting client is using
218	   a URI containing a fragment identifier.

220	2.  Fragment Identification Methods

222	   The identification of fragments of text/plain MIME entities can be
223	   based on different foundations.  Since it is not possible to insert
224	   explicit, invisible identifiers into a text/plain MIME entity (as for
225	   example used in HTML documents, implemented through special
226	   attributes), fragment identification has to rely on certain inherent
227	   criteria of the MIME entity.  This memo specifies fragment
228	   identification using six different methods, which are character
229	   positions and ranges, line positions and ranges, regular expression
230	   matching, and a mechanism for improving the robustness of fragment
231	   identifiers (entity hashes).

233	   When interpreting character or line numbers, implementations MUST
234	   take the character encoding of the MIME entity into account, because
235	   character count and octet count may differ for the character encoding
236	   being used.  For example, a MIME entity using UTF-16 encoding (as
237	   specified in RFC 2718 [12]) uses two octets per character, and it may
238	   have a leading BOM (Byte-Order Mark), which does not count as a
239	   character and thus also affects the mapping from a simple octet count
240	   to a character count.

242	2.1.  Fragment Identification Schemes

244	   Fragment identification can be done using regular expressions or
245	   combining two orthogonal principles, which are positions and ranges,
246	   and characters and lines.  The following section describe the
247	   principles themselves, while Section 2.1.2 describes the combination
248	   of the principles.

250	2.1.1.  Principles

252	2.1.1.1.  Positions and Ranges

254	   A position does not identify an actual fragment of the MIME entity,
255	   but a position inside the MIME entity, which could be regarded as a
256	   fragment of zero length.  The use case for positions is to provide
257	   pointers for applications which may use them to implement
258	   functionalities such as "insert some text here", which needs a
259	   position rather than a fragment.  Positions are counted from zero
260	   (position zero being before the first character or line of a text/
261	   plain MIME entity), so that a text/plain MIME entity having one
262	   character has two positions, one before the first character (position
263	   0), and one after the first character (position 1).

265	   Since positions are fragments of length zero, applications SHOULD use
266	   other methods than highlighting to indicate positions, the most
267	   obvious way being the positioning of a cursor (if the application
268	   supports the concept of a cursor).

270	   Ranges, on the other hand, identify fragments of a MIME entity that
271	   have a length that may be greater than zero.  As a general principle
272	   for ranges, they specify both a lower and a upper bound.  The start
273	   or the end of a range specification may be omitted, defaulting to the
274	   first repectively last position of the MIME entity.  The ending
275	   position of a range must have a value greater than or equal to the
276	   lower position (consequently, a range with identical lower and upper
277	   positions is legal, and identifies a range of length 0, which is
278	   equivalent to a position).  Counting for ranges uses positions, so
279	   that a fragment containing one entity is specified by using a range
280	   with two adjacent positions.

282	   Since ranges are fragments with a length greater than zero,
283	   applications SHOULD use methods like highlighting to indicate ranges
284	   (if the application supports the concept of highlighting).

286	   For positions and ranges it is implicitly assumed that if a number is
287	   greater than the actual number of elements in the MIME entity, then
288	   it is referring to the last element of the MIME entity (see Section 4
289	   for the processing model).

291	2.1.1.2.  Characters and Lines

293	   The concept of positions and ranges may be applied to characters and
294	   lines.  In both cases, positions indicate points between entities,
295	   while ranges identify zero or more entities by indicating positions.

297	   Character positions are numbered starting with zero (ignoring initial
298	   BOM marks or similar concepts that are not part of the actual textual
299	   content of a text/plain MIME entity), and counting each character
300	   separately, with the exception of line endings, which are always
301	   counted as one character (Section 1.1.1 describes how line endings
302	   MUST be identified).

304	   Line positions are numbered starting with zero (with line position
305	   zero always being identical with character position zero), with
306	   Section 1.1.1 describing how line endings MUST be identified.
307	   Fragments identified by lines include the line endings, so
308	   applications identifying line-based fragments MUST include the line
309	   endings in the fragment identification they are using (eg, the
310	   highlighted selection).  If a MIME entity does not contain any line
311	   endings, then it consists of a single (the first) line.

313	2.1.2.  Combining the Principles

315	   In the following sections, the principles described in the preceding
316	   section (positions/ranges and characters/lines) are combined,
317	   resulting in four use cases.

319	2.1.2.1.  Character Position

321	   Using the char scheme followed by a single number, it is possible to
322	   point to a character position (ie, a fragment of length zero between
323	   two characters).  Rather than identifying a fragment consisting of a
324	   number of characters, this method identifies a position between two
325	   characters (or before the first or after the last character).
326	   Character position counting starts with 0, so the character position
327	   before the first character of a text/plain MIME entity has the
328	   character position 0, and a MIME entity containing n distinct
329	   characters has n+1 distinct character positions, the last one having
330	   the character position n.

332	2.1.2.2.  Character Range

334	   If it is necessary to identify a fragment of one or more characters
335	   using character counting, this can be done by using a character
336	   range, using the char scheme followed by a range specification.  A
337	   character range is a consecutive region of the MIME entity that
338	   extends from the starting character position of the range to the
339	   ending character position of the range.

341	2.1.2.3.  Line Position

343	   Using the line scheme followed by a single number, it is possible to
344	   point to a line position (ie, a fragment of length zero between two
345	   lines).  Rather than identifying a fragment consisting of a number of
346	   lines, this method identifies a position between two lines (or before
347	   the first or after the last line).  Line position counting starts
348	   with 0, so the line position before the first line of a text/plain
349	   MIME entity has the line position 0, and a MIME entity containing n
350	   distinct lines has n+1 distinct line positions, the last one having
351	   the line position n.

353	2.1.2.4.  Line Range

355	   If it is necessary to identify a fragment of one or more lines using
356	   line counting, this can be done by using a line range, using the line
357	   scheme followed by a range specification.  A line range is a
358	   consecutive region of the MIME entity that extends from the starting
359	   line position of the range to the ending line position of the range.

361	2.1.3.  Regular Expressions

363	   A common problem with fragment identifiers is their robustness (to
364	   changes in the MIME entity), and character and line counts can break
365	   very easily.  A more robust way of identifying a fragment is by
366	   searching for a specific pattern (another way of making fragment
367	   identifiers more robust is described in Section 2.2 about including
368	   entity hash sums in the fragment identifier).  Thus, it is possible
369	   to use a Basic Regular Expression (BRE) as defined by ISO 9945-2 [6]
370	   (the POSIX standard) as a fragment identifier.

372	2.1.4.  Combining Fragment Identification Scheme Parts

374	   While in most cases only one fragment identification scheme part will
375	   be used, it is possible to combine them.  By simply concatenating
376	   different fragment identification scheme parts, separated by a
377	   semicolon, the whole fragment identifier refers to the union of all
378	   fragments of the text/plain MIME entity identified by the individual
379	   fragment identification scheme parts.  This way, it is possible to
380	   identify disjoint ranges, such as multiple line ranges.

382	   It should be noticed that regular expressions by themselves may
383	   identify disjoint fragments, which is true in any case where the
384	   regular expression matches more than one occurrence in the MIME
385	   entity.

387	   Since disjoint fragments can be identified, implementations SHOULD
388	   make sure that these fragments are appropriately marked, for example
389	   by highlighting the fragment (rather than only scrolling to some
390	   line, which only identifies a single position in the MIME entity).
391	   If an implementation can not mark disjoint fragments, it MAY resort
392	   to marking only the first of the disjoint fragments.  However, the
393	   exact method of how implementations deal with disjoint fragments
394	   depends on the application and interface, and is beyond the scope of
395	   this memo.

397	2.2.  Fragment Identifier Robustness

399	   While regular expressions (as described in Section 2.1.3) may make
400	   fragment identifiers more robust than character or line counts, it is
401	   still possible that modifications of the resource will break the
402	   fragment identifier.  If applications want to create more robust
403	   fragment identifiers, they may do so by adding hash sums to fragment
404	   identifiers.  These hash sums are used to detect a change in the
405	   resource, so that applications may warn users about the possibility
406	   that a fragment identifier might have been broken by a modification
407	   of the resource.  Since fragment identifiers are interpreted by
408	   clients, hash sums are defined on MIME entities rather than the
409	   resource itself, and as such are specific to a certain representation
410	   of the resource, in case of text/plain resources the character
411	   encoding of MIME entity.

413	   Hash sums may specify the character encoding that has been used when
414	   creating the hash sums, and if such a specification is present,
415	   clients MUST check whether the character encoding specified for the
416	   hash sum and the character encoding of the retrieved MIME entity are
417	   equal, and clients MUST NOT check the hash sum if these values
418	   differ.  However, clients MAY choose to transcode the retrieved MIME
419	   entity in the case of differing character encodings, and after doing
420	   so, they MAY check the hash sum (please note that this method is
421	   inhererently unreliable, though, because certain characters or
422	   character sequences may have been lost or normalized due to
423	   restrictions of the coded character set).

425	3.  Fragment Identification Syntax

427	   The syntax for the fragment identifiers is straightforward.  The
428	   syntax defines four schemes, 'char', 'line', 'match', and hash (which
429	   can either be 'length' or 'md5').  The 'char' and 'line' can be used
430	   in two different variants, either the position variant (with a single
431	   number), or the range variant (with two comma-separated positions).
432	   The 'match' scheme has a regular expression as parameter, which must
433	   be specified as a string with escaped semicolons (because the
434	   semicolon is used to concatenate multiple fragment identification
435	   scheme parts).  The hash scheme can either use the 'length' or the
436	   'md5' scheme to specify a hash value.

438	   The following syntax definition uses ABNF as defined in RFC 4234 [7].

440	   text-fragment =  text-scheme 0*( ";" text-scheme) 0*( ";" hash-scheme)
441	   text-scheme   =  ( char-scheme / line-scheme / match-scheme )
442	   hash-scheme   =  ( length-scheme / md5-scheme ) [ "," charenc ]
443	   char-scheme   =  "char=" ( position / range )
444	   line-scheme   =  "line=" ( position / range )
445	   match-scheme  =  "match=" regex
446	   position      =  number
447	   range         =  (position "," [ position ]) / ("," position )
448	   number        =  1*( DIGIT )
449	   regex         =  StringWithEscapedSemicolon
450	   length-scheme =  "length=" number
451	   md5-scheme    =  "md5=" md5-value
452	   md5-value     =  32( hexdigit )
453	   hexdigit      =  (DIGIT / "a" / "A" / "b" / "B" / "c" / "C" / "d" / "D" / "e" / "E" / "f" / "F" )
454	   charenc       =  StringWithEscapedSemicolon

456	   The StringWithEscapedSemicolon is a string where all characters may
457	   appear literally (except the characters which are required by the URI
458	   syntax to be escaped), with the exception of a semicolon.  A
459	   semicolon that should be part of the regular expression must be
460	   escaped with a leading backslash, and implementations MUST make sure
461	   to properly interpret regular expressions, properly dereferencing all
462	   escape mechanisms that apply (ie, URI encoding, semicolon escaping,
463	   and BRE escaping, as well as any additional escaping that may be
464	   present due to the context of the URI).

466	3.1.  Non-ASCII Characters in Regular Expressions

468	   RFC 3986 [5] only allows a subset of ASCII as characters in URIs.
469	   Consequently, it is not possible to use non-ASCII characters in URIs.
470	   However, using Internationalized Resource Identifiers (IRI) as
471	   defined by RFC 3987 [8], it is possible to use non-ASCII characters,
472	   using the encoding defined by IRI.  Thus, using IRIs it is possible
473	   to use non-ASCII characters in regular expressions, and
474	   implementations MUST make sure to correctly handle any non-ASCII
475	   characters in regular expressions, if they accept IRI-encoded text/
476	   plain fragment identifiers.

478	3.2.  Hash Sums

480	   A hash sum can either specify a MIME entity's length, or its MD5
481	   fingerprint.  In both cases, it can optionally specify the character
482	   encoding which had been used when calculating the hash sum, so that
483	   clients interpreting the fragment identifier may check whether they
484	   are using the same character encoding for their calculations.  For
485	   lenghts, the character encoding is necessary because it may influence
486	   the character count (for example, a combining a-umlaut character
487	   which counts as two characters in Unicode will be collapsed to a
488	   single a-umlaut character in ISO 8859 encoding).  Using Unicode
489	   terminology, this means that the length of a text/plain MIME entity
490	   is computed based on its "code points" (other possibilities would
491	   have included "code units", which depend on the encoding, and
492	   "graphemes", which require knowledge about code point semantics).
493	   For MD5 fingerprints, the character encoding is necessary because the
494	   MD5 algorithm works on the binary representation of the text/plain
495	   resource.

497	   The length of a text/plain MIME entity is calculated by using the
498	   principles defined in Section 2.1.1.2.  The MD5 fingerprint of a
499	   text/plain MIME entity is calculated by using the algorithm presented
500	   in [9], encoding the result in 16 hexadecimal digits (using uppercase
501	   or lowercase letters) as a representation of the 128 bit which are
502	   the result of the MD5 algorithm.

504	4.  Fragment Identifier Processing

506	4.1.  Handling of position Values

508	   If any position value (as a position or inside a range) is greater
509	   than the value for the actual MIME entity, then it identifies the
510	   last character or line position of the MIME entity.  If the first
511	   position value in a range is not present, then the range extends from
512	   the start of the MIME entity.  If the second position value in a
513	   range is not present, then the range extends to the end of the MIME
514	   entity.  If a range scheme's positions are not properly ordered (ie,
515	   the first number is less than the second), then this scheme part MUST
516	   be ignored.

518	4.2.  Handling of Hash Sums

520	   Clients are not required to implement the handling of hash sums, so
521	   they MAY choose to ignore hash sum information altogether.  However,
522	   if they do implement hash sum handling, they MUST implement it as
523	   follows:

525	   If a fragment identifier contains a hash sum, and a client retrieves
526	   a MIME entity and detects that the hash sum has changed (observing
527	   the character encoding specification as described in Section 3.2, if
528	   present), then the client SHOULD NOT interpret any other text/plain
529	   fragment identifier scheme part.  A client MAY signal this situation
530	   to the user.

532	4.3.  Syntax Errors in Fragment Identifiers

534	   If a fragment identifier contains a syntax error (i.e., does not
535	   conform to the syntax specified in Section 3), then it MUST be
536	   ignored by clients.  Clients SHOULD NOT make any attempt to correct
537	   or guess fragment identifiers.  Syntax errors MAY be reported by
538	   clients.

540	5.  Examples

542	   The following examples show some usages for the fragment identifiers
543	   defined in this memo.

545	   http://example.com/text.txt#char=100

547	   This URI identifies the position after the 100th character of the
548	   text.txt MIME entity.  It should be noted that it is not clear which
549	   octet(s) of the MIME entity this will be without retrieving the MIME
550	   entity and thus knowing which character encoding is it using (in case
551	   of HTTP, this information will be given in the response's Content-
552	   type header).  If the MIME entity has fewer than 100 characters, the
553	   URI identifies the position after the MIME entity's last character.

555	   http://example.com/text.txt#line=10,20

557	   This URI identifies lines 11 to 20 of the text.txt MIME entity.  If
558	   the MIME entity has fewer than 11 lines, it identifies the position
559	   after last line.  If the MIME entity has less than 20 but at least 11
560	   lines, it identifies the lines 11 to the last line of the MIME
561	   entity.

563	   http://example.com/text.txt#match=searchterm

565	   This URI identifies all occurrences of the regular expression
566	   'searchterm' in the MIME entity, ie all occurrences of the string
567	   'searchterm'.  If there is more than one occurrence, then this URI
568	   identifies a disjoint fragment, consisting of all of these
569	   occurrences.  If there is no occurrence of the search term, the URI
570	   does not identify a fragment.

572	   http://example.com/text.txt#line=,1;match=searchterm

574	   This URI identifies the first line and all occurrences of the regular
575	   expression 'searchterm' in the MIME entity.  If there is an
576	   occurrence of 'searchterm' outside of the first line, then this URI
577	   identifies a disjoint fragment.

579	   http://example.com/text.txt#match=hello\;

581	   This URI identifies all occurrences of the regular expression
582	   'hello;' in the MIME entity.  The semicolon with the leading
583	   backslash has to be interpreted as a literal semicolon inside of the
584	   BRE, treating the '\;' as an escaped ';', so that the actual regular
585	   expression is 'hello;'.  If there is more than one occurrence of this
586	   regular expression, then this URI identifies a disjoint fragment,
587	   consisting of all of these occurrences.

589	   http://example.com/text.txt#line=10,20;length=9876,UTF-8

591	   As in the first example, this URI identifies lines 11 to 20 of the
592	   text.txt MIME entity.  The additional length hash sum specifies that
593	   the MIME entity has a length of 9876 code points when encoded in
594	   UTF-8.  If the client supports the length hash sum scheme, it may
595	   test the retrieved MIME entity for its length, but only if the
596	   retrieved MIME entity uses the UTF-8 encoding or has been locally
597	   trancoded into this encoding.  If the length of the retrieved MIME
598	   entity does not match the specified length in the fragment
599	   identifier, the client SHOULD NOT interpret the line part and MAY
600	   signal this to the user.

602	6.  Security Considerations

604	   Regular expression matching code is notoriously vulnerable to buffer
605	   overflow security holes, so any implementation supporting text/plain
606	   fragment identifiers SHOULD make sure that the code being used has
607	   been tested against buffer overflow attacks.

609	7.  Change Log

611	   This section will not be part of the final RFC text, it serves as a
612	   container for collecting the history of the individual draft
613	   versions.

615	7.1.  From -04 to -05

617	   o  Added some explanatory text to the last paragraph of Section 2.2.

619	   o  Added a paragraph about the importance of having fragment
620	      identification capabilities for out-of-line linking methods such
621	      as XLink to Section 1.3.

623	   o  Added explanation of why the charset is important for length hash
624	      sums to Section 3.2.

626	   o  Added text that makes hash sum handling optional and allows
627	      clients to interpret fragment identifiers even if the hash sum did
628	      not match (changed MUST NOT to SHOULD NOT) to Section 4.2.

630	   o  Added example using a length hash sum in Section 5.

632	   o  RFC 2234 (ABNF) has been obsoleted by [7].

634	   o  Removed the "Open Issues" section for preparation of final draft
635	      before submission as RFC.

637	7.2.  From -03 to -04

639	   o  URIs are now defined by RFC 3986 [5], so the text and the
640	      references have been updated.  In particular, RFC3986 defines a
641	      fragment identifier to be part of the URI, whereas in the
642	      obsoleted RFC 2396 URI specification, it was not part of a URI as
643	      such, but of a "URI reference".

645	   o  IRIs are now defined by RFC 3987 [8], so the text and the
646	      references have been updated.

648	   o  Changed IPR clause from RFC 3667 to RFC 3978 (updated version of
649	      RFC 3667).

651	7.3.  From -02 to -03

653	   o  Replaced most occurrences of 'resource' with 'MIME entity',
654	      because the result of dereferencing a URI is not the resource
655	      itself, but some MIME entity (in our case of type text/plain)
656	      representing it.  Thanks to Sandro Hawke for pointing this out.

658	   o  Moved "Open Issues" to the very back of the document.

660	   o  Added Section 4 to define the processing model for fragment
661	      identifiers (moved Section 4.1 from Section 3 to Section 4).

663	   o  Added hash scheme to make fragment identifiers more robust
664	      (Section 2.2).

666	   o  Changed IPR clause from RFC 2026 to RFC 3667 (updated version of
667	      RFC 2026).

669	7.4.  From -01 to -02

671	   o  Fundamental change in semantics: counts turn into positions
672	      (between characters or lines), so in order to identify a character
673	      or line, ranges must be used (which now use positions to specify
674	      the upper and lower bounds of the range).

676	   o  Made the first value of a range optional as well, so that line=,5
677	      also is legal, identifying everything from the start of the MIME
678	      entity to the 5th line.

680	   o  Changed the syntax from paranthesis-style to a more traditional
681	      style using equals-signs.

683	7.5.  From -00 to -01

685	   o  Made the second count value of ranges optional, so that something
686	      like line(10,) is legal and properly defined.

688	   o  Added non-normative reference to Internet draft about non-ASCII
689	      characters in search strings.

691	   o  Added Section 1.4 about incremental deployement.

693	   o  Added more elaborate examples.

695	   o  Added text about regex buffer overflow problems in Section 6.

697	   o  Added Section 1.1.1 about line endings in text/plain resources.

699	   o  Added "Open Issues" to collect open issues regarding this memo
700	      (will be deleted in final RFC text).

702	8.  References

704	8.1.  Normative References

706	   [1]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
707	        Levels", RFC 2119, March 1997.

709	   [2]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
710	        Extensions (MIME) Part One: Format of Internet Message Bodies",
711	        RFC 2045, November 1996.

713	   [3]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
714	        Extensions (MIME) Part Two: Media Types", RFC 2046,
715	        November 1996.

717	   [4]  Gellens, R., "The Text/Plain Format and DelSp Parameters",
718	        RFC 3676, February 2004.

720	   [5]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
721	        Resource Identifier (URI): Generic Syntax", RFC 3986,
722	        January 2005.

724	   [6]  International Organization for Standardization, "Information
725	        technology - Portable Operating System Interface (POSIX) - Part
726	        2: Shell and Utilities", ISO 9945-2, 1993.

728	   [7]  Crocker, D. and P. Overell, "Augmented BNF for Syntax
729	        Specifications: ABNF", RFC 4234, October 2005.

731	   [8]  Duerst, M. and M. Suignard, "Internationalized Resource
732	        Identifiers (IRI)", RFC 3987, January 2005.

734	   [9]  Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321,
735	        April 1992.

737	8.2.  Non-Normative References

739	   [10]  Connolly, D. and L. Masinter, "The 'text/html' Media Type",
740	         RFC 2854, June 2000.

742	   [11]  DeRose, S., Maler, E., and D. Orchard, "XML Linking Language
743	         (XLink) Version 1.0", W3C Recommendation REC-xlink-20010627,
744	         June 2001.

746	   [12]  Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646",
747	         RFC 2781, February 2000.

749	   [13]  Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629,
750	         June 1999.

752	Appendix A.  Where to send Comments

754	   Please send all comments and questions concerning this document to
755	   Erik Wilde.

757	Appendix B.  Acknowledgements

759	   This document has been prepared using the IETF document DTD described
760	   in RFC 2629 [13].

762	   Thanks for comments and suggestions provided by Marcel Baschnagel,
763	   John Cowan, Martin Duerst, Benja Fallenstein, Sandro Hawke, Dan Kohn,
764	   and Henrik Levkowetz.

766	Author's Address

768	   Erik Wilde
769	   ETH Zurich
770	   ETH-Zentrum
771	   8092 Zurich
772	   Switzerland

774	   Phone: +41-44-6325132
775	   Email: net.dret@dret.net
776	   URI:   http://dret.net/netdret/

778	Intellectual Property Statement

780	   The IETF takes no position regarding the validity or scope of any
781	   Intellectual Property Rights or other rights that might be claimed to
782	   pertain to the implementation or use of the technology described in
783	   this document or the extent to which any license under such rights
784	   might or might not be available; nor does it represent that it has
785	   made any independent effort to identify any such rights.  Information
786	   on the procedures with respect to rights in RFC documents can be
787	   found in BCP 78 and BCP 79.

789	   Copies of IPR disclosures made to the IETF Secretariat and any
790	   assurances of licenses to be made available, or the result of an
791	   attempt made to obtain a general license or permission for the use of
792	   such proprietary rights by implementers or users of this
793	   specification can be obtained from the IETF on-line IPR repository at
794	   http://www.ietf.org/ipr.

796	   The IETF invites any interested party to bring to its attention any
797	   copyrights, patents or patent applications, or other proprietary
798	   rights that may cover technology that may be required to implement
799	   this standard.  Please address the information to the IETF at
800	   ietf-ipr@ietf.org.

802	Disclaimer of Validity

804	   This document and the information contained herein are provided on an
805	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
806	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
807	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
808	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
809	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
810	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

812	Copyright Statement

814	   Copyright (C) The Internet Society (2006).  This document is subject
815	   to the rights, licenses and restrictions contained in BCP 78, and
816	   except as set forth therein, the authors retain all their rights.

818	Acknowledgment

820	   Funding for the RFC Editor function is currently provided by the
821	   Internet Society.