idnits 2.17.1 

draft-wilde-text-fragment-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 898.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 909.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 916.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 922.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 20 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  -- The draft header indicates that this document updates RFC2046, but the
     abstract doesn't seem to mention this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
     (Using the creation date from RFC2046, updated by this document, for
     RFC5378 checks: 1995-04-14)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (Jan 17, 2007) is 6308 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'

  ** Obsolete normative reference: RFC 4234 (ref. '7') (Obsoleted by RFC 5234)

  ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref.
     '10')

  -- Obsolete informational reference (is this intentional?): RFC 4288 (ref.
     '12') (Obsoleted by RFC 6838)


     Summary: 5 errors (**), 0 flaws (~~), 3 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                           E. Wilde
3	Internet-Draft                                               UC Berkeley
4	Updates: 2046 (if approved)                                    M. Duerst
5	Intended status: Standards Track                Aoyama Gakuin University
6	Expires: July 21, 2007                                      Jan 17, 2007

8	         URI Fragment Identifiers for the text/plain Media Type
9	                      draft-wilde-text-fragment-06

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on July 21, 2007.

36	Copyright Notice

38	   Copyright (C) The Internet Society (2007).

40	Abstract

42	   This memo defines URI fragment identifiers for text/plain MIME
43	   entities.  These fragment identifiers make it possible to refer to
44	   parts of a text/plain MIME entity, identified by character count or
45	   range, line count or range, or a regular expression.  These
46	   identification methods can be combined to identify more than one sub-
47	   resource of a text/plain MIME entity.  Fragment identifiers may also
48	   contain hash information to make them more robust.

50	Table of Contents

52	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
53	     1.1.  What is text/plain?  . . . . . . . . . . . . . . . . . . .  3
54	     1.2.  What is a URI Fragment Identifier? . . . . . . . . . . . .  3
55	     1.3.  Why text/plain Fragment Identifiers? . . . . . . . . . . .  4
56	     1.4.  Incremental Deployment . . . . . . . . . . . . . . . . . .  5
57	     1.5.  Notation Used in this Memo . . . . . . . . . . . . . . . .  5
58	   2.  Fragment Identification Methods  . . . . . . . . . . . . . . .  5
59	     2.1.  Fragment Identification Principles . . . . . . . . . . . .  6
60	       2.1.1.  Positions and Ranges . . . . . . . . . . . . . . . . .  6
61	       2.1.2.  Characters and Lines . . . . . . . . . . . . . . . . .  7
62	     2.2.  Combining the Principles . . . . . . . . . . . . . . . . .  7
63	       2.2.1.  Character Position . . . . . . . . . . . . . . . . . .  7
64	       2.2.2.  Character Range  . . . . . . . . . . . . . . . . . . .  7
65	       2.2.3.  Line Position  . . . . . . . . . . . . . . . . . . . .  8
66	       2.2.4.  Line Range . . . . . . . . . . . . . . . . . . . . . .  8
67	     2.3.  Regular Expressions  . . . . . . . . . . . . . . . . . . .  8
68	     2.4.  Combining Fragment Identification Scheme Parts . . . . . .  8
69	     2.5.  Fragment Identifier Robustness . . . . . . . . . . . . . .  9
70	   3.  Fragment Identification Syntax . . . . . . . . . . . . . . . .  9
71	     3.1.  Non-ASCII Characters in Regular Expressions  . . . . . . . 10
72	     3.2.  Hash Sums  . . . . . . . . . . . . . . . . . . . . . . . . 11
73	   4.  Fragment Identifier Processing . . . . . . . . . . . . . . . . 11
74	     4.1.  Handling of Line Endings in text/plain MIME Entities . . . 11
75	     4.2.  Handling of Position Values  . . . . . . . . . . . . . . . 12
76	     4.3.  Handling of Hash Sums  . . . . . . . . . . . . . . . . . . 12
77	     4.4.  Syntax Errors in Fragment Identifiers  . . . . . . . . . . 12
78	   5.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
79	   6.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 14
80	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 14
81	   8.  Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 15
82	     8.1.  From -05 to -06  . . . . . . . . . . . . . . . . . . . . . 15
83	     8.2.  From -04 to -05  . . . . . . . . . . . . . . . . . . . . . 16
84	     8.3.  From -03 to -04  . . . . . . . . . . . . . . . . . . . . . 16
85	     8.4.  From -02 to -03  . . . . . . . . . . . . . . . . . . . . . 17
86	     8.5.  From -01 to -02  . . . . . . . . . . . . . . . . . . . . . 17
87	     8.6.  From -00 to -01  . . . . . . . . . . . . . . . . . . . . . 17
88	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
89	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 18
90	     9.2.  Non-Normative References . . . . . . . . . . . . . . . . . 18
91	   Appendix A.  Acknowledgements  . . . . . . . . . . . . . . . . . . 19
92	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19
93	   Intellectual Property and Copyright Statements . . . . . . . . . . 20

95	1.  Introduction

97	   This memo updates the text/plain MIME type defined in RFC 2046 [1] by
98	   defining URI fragment identifiers for text/plain MIME entities.  This
99	   makes it possible to refer to parts of a text/plain MIME entity.
100	   Such parts can be identifier by character count or range, line count
101	   or range, or a regular expression.  Hash information can be added to
102	   a fragment identifier to make it more robust.

104	   This section gives an introduction to the general concepts of text/
105	   plain MIME entities and URI fragment identifiers, and discusses the
106	   need for fragment identifiers for text/plain and deployment issues.
107	   Section 2 discusses the principles and methods on which this memo is
108	   based.  Section 3 gives the syntax, and Section 4 discusses
109	   processing of text/plain fragment identifiers.  Section 5 shows some
110	   examples.

112	1.1.  What is text/plain?

114	   Internet Media Types as defined in RFC 2045 [2] and RFC 2046 [1] are
115	   used to identify different types and sub-types of media.  RFC 2046
116	   [1] and RFC 3676 [3] specify the text/plain media type, which is used
117	   for simple, unformatted text.  Quoting from RFC 2046 [1]: "Plain text
118	   does not provide for or allow formatting commands, font attribute
119	   specifications, processing instructions, interpretation directives,
120	   or content markup.  Plain text is seen simply as a linear sequence of
121	   characters, possibly interrupted by line breaks or page breaks."

123	   The text/plain media type does not restrict the character encoding,
124	   any character encoding may be used.  In the absence of an explicit
125	   character encoding declaration, US-ASCII is assumed as the default
126	   character encoding.  This variability of the character encoding makes
127	   it impossible to count characters in a text/plain MIME entity without
128	   taking the character encoding into account, because there are many
129	   character encodings using more than one octet per character.

131	   The biggest advantage of text/plain MIME entities is their ease of
132	   use and their portability among different platforms.  As long as they
133	   use popular character encodings (such as US-ASCII or UTF-8), they can
134	   be displayed and processed on virtually every computer system.  The
135	   only remaining interoperability issue is the representation of line
136	   endindings, which is discussed in Section 4.1.

138	1.2.  What is a URI Fragment Identifier?

140	   URIs are the identification mechanism for resources on the Web. The
141	   URI syntax specified in RFC 3986 [4] includes as part of a URI a
142	   fragment identifier, separated by a number sign ('#').  The fragment
143	   identifier consists of additional reference information to be
144	   interpreted by the user agent after the retrieval action has been
145	   successfully completed.  The semantics of a fragment identifier is a
146	   property of the data resulting from a retrieval action, regardless of
147	   the type of URI used in the reference.  Therefore, the format and
148	   interpretation of fragment identifiers is dependent on the media type
149	   of the retrieval result.

151	   The most popular fragment identifier is defined for text/html
152	   (defined in RFC 2854 [11]), and makes it possible to refer to a
153	   specific element (identified by the value of a 'name' or 'id'
154	   attribute) of an HTML document.

156	1.3.  Why text/plain Fragment Identifiers?

158	   Referring to specific parts of a resource can be very useful, because
159	   it enables users and applications to create more specific references.
160	   Rather than pointing to a whole resource, users can create references
161	   to the part they really are interested in or want to talk about.
162	   Even though it is suggested that fragment identification methods are
163	   specified in a media type's MIME registration (see [12]), many media
164	   types do not have fragment identification methods associated with
165	   them.

167	   Fragment identifiers are only useful if supported by the client,
168	   because they are only interpreted by the client.  Therefore, a new
169	   fragment identification method will require some time to be adopted
170	   by clients, and older clients will not support it.  However, because
171	   the URI still works even if the fragment identifier is not supported
172	   (the resource is retrieved, but the fragment identifier is not
173	   interpreted), rapid adoption is not highly critical to ensure the
174	   success of a new fragment identification method.

176	   Fragment identifiers for text/plain as defined in this memo make it
177	   possible to refer to specific parts of a text/plain MIME entity,
178	   using concepts of positions and ranges, which may be applied to
179	   characters and lines.  They also support locating a fragment by using
180	   a regular expression for searching for a specific character sequence.
181	   Thus, text/plain fragment identifiers enable users to exchange
182	   information more specifically, thereby reducing time and effort that
183	   is necessary to manually search for the relevant part of a text/plain
184	   MIME entity.

186	   The text/plain format does not support the embedding of links, so in
187	   normal environments, text/plain resources can only serve as targets
188	   for links, and not as sources.  However, when combining the text/
189	   plain fragment identifiers specified in this memo with out-of-line
190	   linking mechanisms such as XLink [13], it is possible to "embed" link
191	   sources into text/plain resources.  Thus, the text/plain fragment
192	   identifiers specified in this memo open a path for plain/text files
193	   to become fully integrated resources in hypermedia systems such as
194	   the Web.

196	1.4.  Incremental Deployment

198	   As long as support for text/plain fragment identifiers is not
199	   implemented everywhere, it is important to consider the implications
200	   of incremental deployment.  Clients (for example, Web browsers) not
201	   supporting the text/plain fragment identifier described in this memo
202	   will work with URI references to text/plain MIME entities, but they
203	   will fail to locate the sub-resource identified by the fragment
204	   identifier.  This is a reasonable fallback behavior, and in general
205	   users should take into account the possibility that a program
206	   interpreting a given URI will fail to interpret the fragment
207	   identifier part.  Since fragment identifier evaluation is local to
208	   the client (and happens after retrieving the MIME entity), there is
209	   no way for a server to determine whether a requesting client is using
210	   a URI containing a fragment identifier.

212	1.5.  Notation Used in this Memo

214	   The capitalized key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
215	   "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
216	   "OPTIONAL" in this document are to be interpreted as described in RFC
217	   2119 [5].

219	2.  Fragment Identification Methods

221	   The identification of fragments of text/plain MIME entities can be
222	   based on different foundations.  Since it is not possible to insert
223	   explicit, invisible identifiers into a text/plain MIME entity (as for
224	   example used in HTML documents, implemented through dedicated
225	   attributes), fragment identification has to rely on certain inherent
226	   properties of the MIME entity.  This memo specifies fragment
227	   identification using six different methods, which are character
228	   positions and ranges, line positions and ranges, regular expression
229	   matching, and a mechanism for improving the robustness of fragment
230	   identifiers (entity hashes).

232	   When interpreting character or line numbers, implementations MUST
233	   take the character encoding of the MIME entity into account, because
234	   character count and octet count may differ for the character encoding
235	   being used.  For example, a MIME entity using UTF-16 encoding (as
236	   specified in RFC 2718 [14]) uses two octets per character in most
237	   cases, and sometimes four octets per character.  It can also have a
238	   leading BOM (Byte-Order Mark), which does not count as a character
239	   and thus also affects the mapping from a simple octet count to a
240	   character count.

242	2.1.  Fragment Identification Principles

244	   Fragment identification can be done using regular expressions or
245	   combining two orthogonal principles, which are positions and ranges,
246	   and characters and lines.  This section describes the principles
247	   themselves, while Section 2.2 describes the combination of the
248	   principles.

250	2.1.1.  Positions and Ranges

252	   A position does not identify an actual fragment of the MIME entity,
253	   but a position inside the MIME entity, which can be regarded as a
254	   fragment of zero length.  The use case for positions is to provide
255	   pointers for applications which may use them to implement
256	   functionalities such as "insert some text here", which needs a
257	   position rather than a fragment.  Positions are counted from zero,
258	   position zero being before the first character or line of a text/
259	   plain MIME entity.  Thus a text/plain MIME entity having one
260	   character has two positions, one before the first character (position
261	   0), and one after the first character (position 1).

263	   Since positions are fragments of length zero, applications SHOULD use
264	   other methods than highlighting to indicate positions, the most
265	   obvious way being the positioning of a cursor (if the application
266	   supports the concept of a cursor).

268	   Ranges, on the other hand, identify fragments of a MIME entity that
269	   have a length that may be greater than zero.  As a general principle
270	   for ranges, they specify both a lower and a upper bound.  The start
271	   or the end of a range specification may be omitted, defaulting to the
272	   first repectively last position of the MIME entity.  The end of a
273	   range must have a value greater than or equal to the start.  A range
274	   with identical start and end is legal, and identifies a range of
275	   length 0, which is equivalent to a position.

277	   Applications that support a concept such as highlighting SHOULD use
278	   such a concept to indicate fragments of length greater than zero to
279	   the user.

281	   For positions and ranges it is implicitly assumed that if a number is
282	   greater than the actual number of elements in the MIME entity, then
283	   it is referring to the last element of the MIME entity (see Section 4
284	   for details).

286	2.1.2.  Characters and Lines

288	   The concept of positions and ranges can be applied to characters or
289	   lines.  In both cases, positions indicate points between entities,
290	   while ranges identify zero or more entities by indicating positions.

292	   Character positions are numbered starting with zero (ignoring initial
293	   BOM marks or similar concepts that are not part of the actual textual
294	   content of a text/plain MIME entity), and counting each character
295	   separately, with the exception of line endings, which are always
296	   counted as one character (see Section 4.1 for details).

298	   Line positions are numbered starting with zero (with line position
299	   zero always being identical with character position zero), with
300	   Section 4.1 describing how line endings are be identified.  Fragments
301	   identified by lines include the line endings, so applications
302	   identifying line-based fragments MUST include the line endings in the
303	   fragment identification they are using (e.g., the highlighted
304	   selection).  If a MIME entity does not contain any line endings, then
305	   it consists of a single (the first) line.

307	2.2.  Combining the Principles

309	   In the following sections, the principles described in the preceding
310	   section (positions/ranges and characters/lines) are combined,
311	   resulting in four use cases.  The fragment identifier syntax,
312	   described in detail in Section 3, uses various schemes for different
313	   purposes.

315	2.2.1.  Character Position

317	   To identify a character position (i.e., a fragment of length zero
318	   between two characters), the 'char' scheme followed by a single
319	   number is used.  Rather than identifying a fragment consisting of a
320	   number of characters, this method identifies a position between two
321	   characters (or before the first or after the last character).
322	   Character position counting starts with 0, so the character position
323	   before the first character of a text/plain MIME entity has the
324	   character position 0, and a MIME entity containing n distinct
325	   characters has n+1 distinct character positions, the last one having
326	   the character position n.

328	2.2.2.  Character Range

330	   To identify a fragment of one or more characters (a character range),
331	   the 'char' scheme followed by a range specification is used.  A
332	   character range is a consecutive region of the MIME entity that
333	   extends from the starting character position of the range to the
334	   ending character position of the range.

336	2.2.3.  Line Position

338	   To identify a line position (i.e., a fragment of length zero between
339	   two lines), the 'line' scheme followed by a single number is used.
340	   Rather than identifying a fragment consisting of a number of lines,
341	   this method identifies a position between two lines (or before the
342	   first or after the last line).  Line position counting starts with 0,
343	   so the line position before the first line of a text/plain MIME
344	   entity has the line position 0, and a MIME entity containing n
345	   distinct lines has n+1 distinct line positions, the last one having
346	   the line position n.

348	2.2.4.  Line Range

350	   To identify a fragment of one or more lines (a line range), the
351	   'line' scheme followed by a range specification is used.  A line
352	   range is a consecutive region of the MIME entity that extends from
353	   the starting line position of the range to the ending line position
354	   of the range.

356	2.3.  Regular Expressions

358	   A common problem with fragment identifiers is their robustness (to
359	   changes in the MIME entity), and character and line counts can break
360	   very easily.  A more robust way of identifying a fragment is by
361	   searching for a specific pattern.  Using the 'match' scheme, it is
362	   possible to use a Basic Regular Expression (BRE) as defined by ISO
363	   9945-2 [6] (the POSIX standard) as a fragment identifier.  For
364	   another way of making fragment identifiers more robust, see
365	   Section 2.5.

367	2.4.  Combining Fragment Identification Scheme Parts

369	   In most cases, a fragment identifier will consist of only one
370	   fragment identification scheme part.  However, by concatenating them,
371	   separated with a semicolon, it is possible to use several fragment
372	   identification scheme parts in a fragment identifier.  The whole
373	   fragment identifier refers to the union of all fragments of the text/
374	   plain MIME entity identified by the individual fragment
375	   identification scheme parts.  In this way, it is possible to identify
376	   disjoint ranges, such as multiple line ranges.

378	   It should be noticed that regular expressions by themselves may
379	   identify disjoint fragments, which is true in any case where the
380	   regular expression matches more than one occurrence in the MIME
381	   entity.

383	   Since disjoint fragments can be identified, implementations SHOULD
384	   make sure that these fragments are appropriately marked, for example
385	   by highlighting the fragment (rather than only scrolling to some
386	   line, which only identifies a single position in the MIME entity).
387	   If an implementation can not mark disjoint fragments, it MAY resort
388	   to marking only the first of the disjoint fragments.  However, the
389	   exact method of how implementations deal with disjoint fragments
390	   depends on the application and interface, and is beyond the scope of
391	   this memo.

393	2.5.  Fragment Identifier Robustness

395	   While regular expressions (as described in Section 2.3) may make
396	   fragment identifiers more robust than character or line counts, it is
397	   still possible that modifications of the resource will break the
398	   fragment identifier.  If applications want to create more robust
399	   fragment identifiers, they may do so by adding hash sums to fragment
400	   identifiers.  These hash sums are used to detect a change in the
401	   resource.  Applications can then warn users about the possibility
402	   that a fragment identifier might have been broken by a modification
403	   of the resource.

405	   Since fragment identifiers are interpreted by clients, hash sums are
406	   defined on MIME entities rather than the resource itself, and as such
407	   are specific to a certain representation of the resource, in case of
408	   text/plain resources the character encoding of the MIME entity.

410	   Hash sums may specify the character encoding that has been used when
411	   creating the hash sums, and if such a specification is present,
412	   clients MUST check whether the character encoding specified for the
413	   hash sum and the character encoding of the retrieved MIME entity are
414	   equal, and clients MUST NOT check the hash sum if these values
415	   differ.  However, clients MAY choose to transcode the retrieved MIME
416	   entity in the case of differing character encodings, and after doing
417	   so, check the hash sum.  Please note that this method is inhererently
418	   unreliable, because certain characters or character sequences may
419	   have been lost or normalized due to restrictions in one of the
420	   character encodings used.

422	3.  Fragment Identification Syntax

424	   The syntax for the fragment identifiers is straightforward.  The
425	   syntax defines four schemes, 'char', 'line', 'match', and hash (which
426	   can either be 'length' or 'md5').  The 'char' and 'line' schemes can
427	   be used in two different variants, either the position variant (with
428	   a single number), or the range variant (with two comma-separated
429	   numbers).  The 'match' scheme has a regular expression as its
430	   parameter, which must be specified as a string with escaped
431	   semicolons (because the semicolon is used to concatenate multiple
432	   fragment identification scheme parts).  The hash scheme can either
433	   use the 'length' or the 'md5' scheme to specify a hash value.

435	   The following syntax definition uses ABNF as defined in RFC 4234 [7],
436	   including the rules DIGIT and HEXDIG.

438	  text-fragment =  text-scheme 0*( ";" text-scheme) 0*( ";" hash-scheme)
439	  text-scheme   =  ( char-scheme / line-scheme / match-scheme )
440	  hash-scheme   =  ( length-scheme / md5-scheme ) [ "," charenc ]
441	  char-scheme   =  %x63.68.61.72 "=" ( position / range )  ; "char="
442	  line-scheme   =  %x6C.69.6E.65 "=" ( position / range )  ; "line="
443	  match-scheme  =  %x6D.61.74.63.68 "=" regex  ; "match="
444	  position      =  number
445	  range         =  (position "," [ position ]) / ("," position )
446	  number        =  1*( DIGIT )
447	  regex         =  StringWithEscapedSemicolon
448	  length-scheme =  %x6C.65.6E.67.74.68 "=" number  ; "length="
449	  md5-scheme    =  %x6D.64 "5=" md5-value  ; "md5="
450	  md5-value     =  32HEXDIG
451	  charenc       =  StringWithEscapedSemicolon

453	   The StringWithEscapedSemicolon is a string where all characters may
454	   appear literally (except the characters which are required by the URI
455	   syntax to be escaped), with the exception of a semicolon.  A
456	   semicolon that is part of the regular expression must be escaped with
457	   a leading backslash, and implementations MUST properly interpret
458	   regular expressions, dereferencing all escape mechanisms that apply,
459	   i.e. any escaping present due to the context of the URI, semicolon
460	   escaping, URI percent-encoding, and BRE escaping, in that order).

462	3.1.  Non-ASCII Characters in Regular Expressions

464	   RFC 3986 [4] only allows a subset of ASCII as characters in URIs.
465	   Non-ASCII octets can be included using percent-encoding.  Non-ASCII
466	   characters in regular expressions MUST be encoded using UTF-8 [8]
467	   before applying percent-encoding, and MUST be interpreted using UTF-8
468	   after resolving percent-encoding.  Therefore, using Internationalized
469	   Resource Identifiers (IRIs) [9] it is possible to use non-ASCII
470	   characters directly in regular expressions.  Implementations that
471	   support plain text fragment identifiers for documents not encoded in
472	   US-ASCII SHOULD support regular expressions with non-ASCII
473	   characters, or MUST ignore such regular expressions.

475	3.2.  Hash Sums

477	   A hash sum can either specify a MIME entity's length, or its MD5
478	   fingerprint.  In both cases, it can optionally specify the character
479	   encoding which had been used when calculating the hash sum, so that
480	   clients interpreting the fragment identifier may check whether they
481	   are using the same character encoding for their calculations.  For
482	   lenghts, the character encoding can be necessary because it can
483	   influence the character count.  As an example, Unicode includes
484	   precomposed characters for writing Vietnamese, but in the windows-
485	   1258 encoding, also used for writing Vietnamese, some characters have
486	   to be encoded with separate diacritics, which means that two
487	   characters are counted.  Applying Unicode terminology, this means
488	   that the length of a text/plain MIME entity is computed based on its
489	   "code points".  For MD5 fingerprints, the character encoding is
490	   necessary because the MD5 algorithm works on the binary
491	   representation of the text/plain resource.

493	   The length of a text/plain MIME entity is calculated by using the
494	   principles defined in Section 2.1.2.  The MD5 fingerprint of a text/
495	   plain MIME entity is calculated by using the algorithm presented in
496	   [10], encoding the result in 16 hexadecimal digits (using uppercase
497	   or lowercase letters) as a representation of the 128 bits which are
498	   the result of the MD5 algorithm.

500	4.  Fragment Identifier Processing

502	4.1.  Handling of Line Endings in text/plain MIME Entities

504	   In Internet messages, line endings in text/plain MIME entities are
505	   represented by CR+LF character sequences (see RFC 2046 [1] and RFC
506	   3676 [3]).  However, some protocols (such as HTTP) in addition allow
507	   other conventions for line breaks.  Also, some operating systems
508	   store text/plain entities locally with different line endings (in
509	   most cases, Unix uses LF, MacOS uses CR, and Windows uses CR+LF).

511	   Independent of the number of bytes or characters used to represent a
512	   line ending, each line ending MUST be counted as one single
513	   character.  For the purpose of regular expression matching, all
514	   representations of line endings MUST be treated as single LF
515	   characters (matched by \n).  Implementations interpreting text/plain
516	   fragment identifiers MUST take into account the line ending
517	   conventions of the protocols and other contexts that they work in.

519	   As an example, an implementation working in the context of a Web
520	   browser supporting http: URIs has to support the various line ending
521	   conventions permitted by HTTP.  As another example, an implementation
522	   used on local files (e.g. with the file: URI scheme) has to support
523	   the conventions used for local storage.  All implementations SHOULD
524	   support the Internet-wide CR+LF line ending convention, and MAY
525	   support additional conventions not related to the protocols or
526	   systems they work with.

528	   Implementers should be aware of the fact that line endings in plain
529	   text entities can be represented by other characters or character
530	   sequences than CR+LF.  Besides the abovementioned CR and LF, there
531	   are also NEL and CR+NEL.  In general, the encoding of line endings
532	   can also depend on the character encoding of the MIME entity, and
533	   implementations have to take this into account where necessary.

535	4.2.  Handling of Position Values

537	   If any position value (as a position or as part of a range) is
538	   greater than the length of the actual MIME entity, then it identifies
539	   the last character or line position of the MIME entity.  If the first
540	   position value in a range is not present, then the range extends from
541	   the start of the MIME entity.  If the second position value in a
542	   range is not present, then the range extends to the end of the MIME
543	   entity.  If a range scheme's positions are not properly ordered (ie,
544	   the first number is less than the second), then this scheme part MUST
545	   be ignored.

547	4.3.  Handling of Hash Sums

549	   Clients are not required to implement the handling of hash sums, so
550	   they MAY choose to ignore hash sum information altogether.  However,
551	   if they do implement hash sum handling, the following applies:

553	   If a fragment identifier contains a hash sum, and a client retrieves
554	   a MIME entity and detects that the hash sum has changed (observing
555	   the character encoding specification as described in Section 3.2, if
556	   present), then the client SHOULD NOT interpret any other text/plain
557	   fragment identifier scheme part.  A client MAY signal this situation
558	   to the user.

560	4.4.  Syntax Errors in Fragment Identifiers

562	   If a fragment identifier contains a syntax error (i.e., does not
563	   conform to the syntax specified in Section 3), then it MUST be
564	   ignored by clients.  Clients SHOULD NOT make any attempt to correct
565	   or guess fragment identifiers.  Syntax errors MAY be reported by
566	   clients.

568	5.  Examples

570	   The following examples show some usages for the fragment identifiers
571	   defined in this memo.

573	   http://example.com/text.txt#char=100

575	   This URI identifies the position after the 100th character of the
576	   text.txt MIME entity.  It should be noted that it is not clear which
577	   octet(s) of the MIME entity this will be without retrieving the MIME
578	   entity and thus knowing which character encoding it is using (in case
579	   of HTTP, this information will be given in the Content-Type header of
580	   the response).  If the MIME entity has fewer than 100 characters, the
581	   URI identifies the position after the MIME entity's last character.

583	   ftp://example.com/text.txt#line=10,20

585	   This URI identifies lines 11 to 20 of the text.txt MIME entity.  If
586	   the MIME entity has fewer than 11 lines, it identifies the position
587	   after last line.  If the MIME entity has less than 20 but at least 11
588	   lines, it identifies the lines 11 to the last line of the MIME
589	   entity.

591	   http://example.com/text.txt#match=searchterm

593	   This URI identifies all occurrences of the regular expression
594	   'searchterm' in the MIME entity, i.e., all occurrences of the string
595	   'searchterm'.  If there is more than one occurrence, then this URI
596	   identifies a disjoint fragment, consisting of all of these
597	   occurrences.  If there is no occurrence of the search term, the URI
598	   does not identify a fragment.

600	   ftp://example.com/text.txt#line=,1;match=searchterm

602	   This URI identifies the first line and all occurrences of the regular
603	   expression 'searchterm' in the MIME entity.  If there is an
604	   occurrence of 'searchterm' outside of the first line, then this URI
605	   identifies a disjoint fragment.

607	   http://example.com/text.txt#match=hello\;

609	   This URI identifies all occurrences of the regular expression
610	   'hello;' in the MIME entity.  The semicolon with the leading
611	   backslash has to be interpreted as a literal semicolon inside of the
612	   BRE, treating the '\;' as an escaped ';', so that the actual regular
613	   expression is 'hello;'.  If there is more than one occurrence of this
614	   regular expression, then this URI identifies a disjoint fragment,
615	   consisting of all of these occurrences.

617	   ftp://example.com/text.txt#line=10,20;length=9876,UTF-8

619	   As in the second example, this URI identifies lines 11 to 20 of the
620	   text.txt MIME entity.  The additional length hash sum specifies that
621	   the MIME entity has a length of 9876 characters when encoded in
622	   UTF-8.  If the client supports the length hash sum scheme, it may
623	   test the retrieved MIME entity for its length, but only if the
624	   retrieved MIME entity uses the UTF-8 encoding or has been locally
625	   trancoded into this encoding.  If the length of the retrieved MIME
626	   entity does not match the length specified in the fragment
627	   identifier, the client SHOULD NOT interpret the line part and MAY
628	   signal this to the user.

630	6.  IANA Considerations

632	   Note to RFC Editor: Please change this section to read as follows
633	   after the IANA action has been completed: "IANA has added a reference
634	   to this specification in the Text/Plain Media Type registration."

636	   IANA is requested to update the registration of the MIME Media type
637	   text/plain at http://www.iana.org/assignments/media-types/text/ with
638	   the fragment identifier defined in this memo by adding a reference to
639	   this memo (with the appropriate RFC number once it is known).

641	7.  Security Considerations

643	   Regular expression matching code is notoriously vulnerable to buffer
644	   overflow security holes, so any implementation supporting text/plain
645	   fragment identifiers SHOULD make sure that the code being used has
646	   been tested against buffer overflow attacks.

648	   The fact that software implementing fragment identifiers for plain
649	   text and software not implementing them differs in behavior, and the
650	   fact that different software may show fragments to users in different
651	   ways (in particular for fragments consisting of multiple ranges) can
652	   lead to misunderstandings on the part of users.  Such
653	   misunderstandings might be exploited in a way similar to spoofing or
654	   phishing, although concrete examples of how this might be done are
655	   not currently known.

657	   Implementers and users of fragment identifiers for plain text should
658	   also be aware of the security considerations in RFC 3986 [4] and RFC
659	   3987 [9].

661	8.  Change Log

663	   Note to RFC Editor: Please remove this section before publication.

665	8.1.  From -05 to -06

667	   o  Clarified that this is intended as an update of the text/plain
668	      MIME type registration, in newly added IANA consideration section
669	      and elswhere.

671	   o  Added normative reference to UTF-8 (STD63/RFC3629).

673	   o  Fixed section about non-ASCII characters in regular expressions to
674	      be more accurate re.  IRIs.

676	   o  Fixed some text about decomposition and Unicode.

678	   o  Clarified that UTF-16 can also use 4 octets per character.

680	   o  Changed ABNF to make sure schemes are case-sensitive (string
681	      literals in ABNF are case-insensitive).

683	   o  Used HEXDIG from RFC 4234, made clear DIGIT and HEXDIG are from
684	      that spec.

686	   o  Speficied order of decoding the various escapings.

688	   o  Moved section on line endings to the back, and changed
689	      requirements to be more in line with practice.

691	   o  Added IANA Consideration section.

693	   o  Expanded Security Consideration section.

695	   o  Removed quote from RFC 3986, because the quoted text doesn't
696	      actually exist there anymore; changed text appropriately.

698	   o  Reorganized section two to get rid of one section level.

700	   o  Added overview in introduction, and some glue text here and there.

702	   o  Changed to more IETF-like wording in some instances (e.g. intro to
703	      this section; removing "Compliant software MUST follow this
704	      specification." at the start of the Introduction,...).

706	   o  Removed 'where to send comments' section.

708	   o  Fixed wording is some cases, tried to make shorter sentences and
709	      eliminate parenthetesized expressions.

711	   o  Removed acknowledgement for xml2rfc; we are nevertheless very
712	      grateful for this work!

714	8.2.  From -04 to -05

716	   o  Added some explanatory text to the last paragraph of Section 2.5.

718	   o  Added a paragraph about the importance of having fragment
719	      identification capabilities for out-of-line linking methods such
720	      as XLink to Section 1.3.

722	   o  Added explanation of why the charset is important for length hash
723	      sums to Section 3.2.

725	   o  Added text that makes hash sum handling optional and allows
726	      clients to interpret fragment identifiers even if the hash sum did
727	      not match (changed MUST NOT to SHOULD NOT) to Section 4.3.

729	   o  Added example using a length hash sum in Section 5.

731	   o  RFC 2234 (ABNF) has been obsoleted by [7].

733	   o  Removed the "Open Issues" section for preparation of final draft
734	      before submission as RFC.

736	8.3.  From -03 to -04

738	   o  URIs are now defined by RFC 3986 [4], so the text and the
739	      references have been updated.  In particular, RFC3986 defines a
740	      fragment identifier to be part of the URI, whereas in the
741	      obsoleted RFC 2396 URI specification, it was not part of a URI as
742	      such, but of a "URI reference".

744	   o  IRIs are now defined by RFC 3987 [9], so the text and the
745	      references have been updated.

747	   o  Changed IPR clause from RFC 3667 to RFC 3978 (updated version of
748	      RFC 3667).

750	8.4.  From -02 to -03

752	   o  Replaced most occurrences of 'resource' with 'MIME entity',
753	      because the result of dereferencing a URI is not the resource
754	      itself, but some MIME entity (in our case of type text/plain)
755	      representing it.  Thanks to Sandro Hawke for pointing this out.

757	   o  Moved "Open Issues" to the very back of the document.

759	   o  Added Section 4 to define the processing model for fragment
760	      identifiers (moved Section 4.2 from Section 3 to Section 4).

762	   o  Added hash scheme to make fragment identifiers more robust
763	      (Section 2.5).

765	   o  Changed IPR clause from RFC 2026 to RFC 3667 (updated version of
766	      RFC 2026).

768	8.5.  From -01 to -02

770	   o  Fundamental change in semantics: counts turn into positions
771	      (between characters or lines), so in order to identify a character
772	      or line, ranges must be used (which now use positions to specify
773	      the upper and lower bounds of the range).

775	   o  Made the first value of a range optional as well, so that line=,5
776	      also is legal, identifying everything from the start of the MIME
777	      entity to the 5th line.

779	   o  Changed the syntax from paranthesis-style to a more traditional
780	      style using equals-signs.

782	8.6.  From -00 to -01

784	   o  Made the second count value of ranges optional, so that something
785	      like line(10,) is legal and properly defined.

787	   o  Added non-normative reference to Internet draft about non-ASCII
788	      characters in search strings.

790	   o  Added Section 1.4 about incremental deployement.

792	   o  Added more elaborate examples.

794	   o  Added text about regex buffer overflow problems in Section 7.

796	   o  Added Section 4.1 about line endings in text/plain resources.

798	   o  Added "Open Issues" to collect open issues regarding this memo
799	      (will be deleted in final RFC text).

801	9.  References

803	9.1.  Normative References

805	   [1]   Freed, N. and N. Borenstein, "Multipurpose Internet Mail
806	         Extensions (MIME) Part Two: Media Types", RFC 2046,
807	         November 1996.

809	   [2]   Freed, N. and N. Borenstein, "Multipurpose Internet Mail
810	         Extensions (MIME) Part One: Format of Internet Message Bodies",
811	         RFC 2045, November 1996.

813	   [3]   Gellens, R., "The Text/Plain Format and DelSp Parameters",
814	         RFC 3676, February 2004.

816	   [4]   Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
817	         Resource Identifier (URI): Generic Syntax", RFC 3986,
818	         January 2005.

820	   [5]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
821	         Levels", RFC 2119, March 1997.

823	   [6]   International Organization for Standardization, "Information
824	         technology - Portable Operating System Interface (POSIX) - Part
825	         2: Shell and Utilities", ISO 9945-2, 1993.

827	   [7]   Crocker, D. and P. Overell, "Augmented BNF for Syntax
828	         Specifications: ABNF", RFC 4234, October 2005.

830	   [8]   Yergeau, F., "UTF-8, a transformation format of ISO 10646",
831	         STD 63, RFC 3629, November 2003.

833	   [9]   Duerst, M. and M. Suignard, "Internationalized Resource
834	         Identifiers (IRI)", RFC 3987, January 2005.

836	   [10]  Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321,
837	         April 1992.

839	9.2.  Non-Normative References

841	   [11]  Connolly, D. and L. Masinter, "The 'text/html' Media Type",
842	         RFC 2854, June 2000.

844	   [12]  Freed, N. and J. Klensin, "Media Type Specifications and
845	         Registration Procedures", RFC 4288, December 2005.

847	   [13]  DeRose, S., Maler, E., and D. Orchard, "XML Linking Language
848	         (XLink) Version 1.0", W3C Recommendation REC-xlink-20010627,
849	         June 2001.

851	   [14]  Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646",
852	         RFC 2781, February 2000.

854	Appendix A.  Acknowledgements

856	   Thanks for comments and suggestions provided by Marcel Baschnagel,
857	   John Cowan, Benja Fallenstein, Sandro Hawke, Dan Kohn, Henrik
858	   Levkowetz, and Ted Hardie.

860	Authors' Addresses

862	   Erik Wilde
863	   UC Berkeley
864	   School of Information
865	   Berkeley, CA 94720-4600
866	   U.S.A.

868	   Phone: +1-510-6432253
869	   Email: net.dret@dret.net
870	   URI:   http://dret.net/netdret/

872	   Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
873	                 possible, for example as "D&#252;rst" in XML and HTML.)
874	   Aoyama Gakuin University
875	   5-10-1 Fuchinobe
876	   Sagamihara, Kanagawa  229-8558
877	   Japan

879	   Phone: +81 42 759 6329
880	   Fax:   +81 42 759 6495
881	   Email: mailto:duerst@it.aoyama.ac.jp
882	   URI:   http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/

884	Full Copyright Statement

886	   Copyright (C) The Internet Society (2007).

888	   This document is subject to the rights, licenses and restrictions
889	   contained in BCP 78, and except as set forth therein, the authors
890	   retain all their rights.

892	   This document and the information contained herein are provided on an
893	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
894	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
895	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
896	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
897	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
898	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

900	Intellectual Property

902	   The IETF takes no position regarding the validity or scope of any
903	   Intellectual Property Rights or other rights that might be claimed to
904	   pertain to the implementation or use of the technology described in
905	   this document or the extent to which any license under such rights
906	   might or might not be available; nor does it represent that it has
907	   made any independent effort to identify any such rights.  Information
908	   on the procedures with respect to rights in RFC documents can be
909	   found in BCP 78 and BCP 79.

911	   Copies of IPR disclosures made to the IETF Secretariat and any
912	   assurances of licenses to be made available, or the result of an
913	   attempt made to obtain a general license or permission for the use of
914	   such proprietary rights by implementers or users of this
915	   specification can be obtained from the IETF on-line IPR repository at
916	   http://www.ietf.org/ipr.

918	   The IETF invites any interested party to bring to its attention any
919	   copyrights, patents or patent applications, or other proprietary
920	   rights that may cover technology that may be required to implement
921	   this standard.  Please address the information to the IETF at
922	   ietf-ipr@ietf.org.

924	Acknowledgment

926	   Funding for the RFC Editor function is provided by the IETF
927	   Administrative Support Activity (IASA).