idnits 2.17.1 

draft-masinter-url-i18n-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 420 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 3 instances of too long lines in the document, the longest one
     being 6 characters in excess of 72.

  ** There are 8 instances of lines with control characters in the document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (August 30, 1998) is 9368 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC-DUERST' is mentioned on line 147, but not defined

  == Missing Reference: 'XML1' is mentioned on line 282, but not defined

  == Unused Reference: 'RFC 2119' is defined on line 387, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC FTP' is defined on line 409, but no explicit
     reference was found in the text

  == Unused Reference: 'XMl1' is defined on line 415, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 2279 (Obsoleted by RFC 3629)

  ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNI15'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'RFC HTTP'

  ** Obsolete normative reference: RFC 2141 (Obsoleted by RFC 8141)

  ** Obsolete normative reference: RFC 2192 (Obsoleted by RFC 5092)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'RFC FTP'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'HTML4'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'XMl1'


     Summary: 14 errors (**), 0 flaws (~~), 7 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                         Larry Masinter
2	                                                    Xerox Corporation
3	                                                        Martin Duerst
4	                                                  W3C/Keio University
5	draft-masinter-url-i18n-02                            August 30, 1998
6	Expires in 6 months

8	   Representing non-ASCII Characters in URIs and Extended URIs

10	Status of this Memo

12	This document is an Internet-Draft.  Internet-Drafts are working
13	documents of the Internet Engineering Task Force (IETF), its areas,
14	and its working groups.  Note that other groups may also distribute
15	working documents as Internet-Drafts.

17	Internet-Drafts are draft documents valid for a maximum of six months
18	and may be updated, replaced, or obsoleted by other documents at any
19	time.  It is inappropriate to use Internet-Drafts as reference
20	material or to cite them other than as ``work in progress.''

22	To view the entire list of current Internet-Drafts, please check
23	the "1id-abstracts.txt" listing contained in the Internet-Drafts
24	Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net
25	(Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au
26	(Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US
27	West Coast).

29	This document is not a product of any working group, but may
30	be discussed on the mailing list url-i18n@unicode.org.

32	Abstract

34	URIs are defined as sequences of characters chosen from a limited
35	subset of the repertoire of ASCII characters, both for transmission in
36	network protocols and representation in spoken and written human
37	communication.

39	This document defines a uniform way of representing non-ASCII scripts
40	in URIs and in an extended 8-bit form (8URI), so these identifiers can
41	be used for the world's languages. The document gives guidelines for
42	the use and deployment of these forms in various elements of software
43	that deal with URIs.

45	1. Introduction

47	URIs [RFC 2396] are defined as sequences of characters chosen from a
48	limited subset of the repertoire of ASCII characters.  The characters
49	in URIs are frequently used for representing English words and
50	phrases; unfortunately, this leaves out most of the world, who do not
51	write merely with the letters A-Z.

53	The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
54	"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
55	document are to be interpreted as described in RFC 2119.

57	2. Syntax

59	This document defines two ways of representing non-ASCII characters in
60	resource identifiers: a URI syntax which is compatible with the
61	definition of URI syntax [RFC 2396], and a new syntax which is usable
62	in contexts where resource identifiers are transported within "8-bit"
63	environments. This new syntax is called an "8URI"; it is upward
64	compatible with the URI syntax, but is defined as a sequence of 8-bit
65	octets.

67	2.1 URI syntax

69	The standard definition of URIs [RFC 2396] requires that URIs be
70	represented with a very limited repertoire of characters which are a
71	subset of those characters representable in ASCII. URIs are defined as
72	a sequence of characters (since URIs may be written on paper or read
73	out loud) which my be represented as a sequence of 7-bit bytes.

75	Character sequences that include non-ASCII characters must be
76	transcribed to represent them in URIs. The transcription to be applied
77	to a character sequence before it is included in an element of a URI
78	(path, etc.) SHOULD be performed by:

80	1) representing the characters as a sequence of ISO 10646 characters.
81	2) "normalizing" the character sequence to reduce ambiguity.
82	   [UNI15] defines several normalization forms; for the purpose
83	   of representing characters in URIs, "Normalization Form CC".
84	3) encoding the result with the UTF-8 character encoding [RFC 2279]
85	4) using %HH hex-encoding [RFC 2396] to encode any octet that
86	   does not correspond to an allowed, non-reserved character.

88	This syntax is consistent with the definition of the generic URI
89	syntax [RFC 2396], the URN syntax [RFC 2141], as well as recent URL
90	scheme definitions [RFC 2192], [RFC 2384].

92	2.2 8URI syntax

94	This specification defines a new protocol element, called an '8URI'.
95	An 8URI is similar to a URI in its use, but is different in that it is
96	solely for use in network protocols that allow the transport of octets
97	outside of the range allowed within URIs. An 8URI MAY have 8-bit
98	octets within it. An 8URI is represented using the same methods (1-4)
99	defined in section 2.1, but in step (4), octets with the leading bit
100	on need not be encoded; all characters outside of those explicitly
101	disallowed in RFC 2396 (reserved, delimiters, white space, unwise
102	special characters) MAY be represented directly by their UTF-8
103	encoding.

105	An '8URI' for characters outside of the ASCII range will use
106	considerably less space than the corresponding hex-encoded URI.

108	Even within 8URIs, any octet sequence which would likely yield
109	ambiguous or incorrect results when printed or displayed and then
110	subsequently typed by a user SHOULD be hex-encoded.

112	Internet protocols that currently allow the designation of a URI may
113	be extended at some point to allow 8URIs as well as URIs, but this
114	extension must be done explicitly. Section 4 lays out some of the
115	software guidelines that will allow the deployment of 8URIs in
116	existing Internet Protocols.

118	3. Software Requirements and Upgrade Strategy

120	Supporting URIs for non-ASCII characters requires cooperation from the
121	providers of several different components of URI software: software
122	that allows users to enter URIs, software that generates URIs,
123	software that displays URIs, and software that interprets URIs.

125	3.1 URI entry

127	One component of software that deals with URIs allows users to enter a
128	URI, e.g., by typing or dictation. For example, a person viewing a
129	visual representation of a URI (as a sequence of glyphs, in some
130	order, in some visual display) might use a keyboard entry method for
131	keys in that language to create the URI. For ASCII characters with
132	standard English keyboards, the process is simple, since there is
133	generally a simple correspondence between letters represented, keys
134	pressed, and internal system representation, but for other languages
135	the process is much more complex.

137	If the visual representation contains only those characters that are
138	allowed [RFC 2396] standard syntax of URIs, the transcription is
139	simple. However, for all other sequences of characters, it is
140	RECOMMENDED that the entry results in characters, in logical order
141	from the ISO 10646 character repertoire, encoded using the UTF-8
142	method [RFC 2279], and then subsequently encoded as necessary using
143	the URI hex-encoding. The set of octets that require encoding
144	depending on whether the result is a URI or an 8URI.

146	The characters the user has entered should be normalized according to
147	the rules in [RFC-DUERST]; for example, all accented characters should
148	be translated into their combined form, no extraneous BIDI
149	(bidirectional) marks should be left in the resulting stream, and that
150	characters that are intended to represent Western European letters
151	should be transcribed into their ISO-8859-1 equivalents and not, for
152	example, as double-wide characters.

154	Whether URI entry should result in a URI or an 8URI will depend on the
155	capability of the protocol or software to which the result will be
156	submitted.

158	3.2 URI generation

160	Systems that are offering resources through the Internet, where those
161	resources have logical names, sometimes offer the ability to generate
162	URIs for the resources they offer.  For example, some HTTP servers
163	offer the ability to generate a 'directory listing' for file
164	directories under their purvue, and then to respond to the generated
165	URIs with the files. If the names of the files consist solely of
166	US-ASCII characters the transcription is simple, but other file
167	systems offer a wider variety of characters. Many currently deployed
168	systems currently do not transform the local character representation
169	of the underlying system before generating URIs.

171	For maximum interoperability, systems that generate resource
172	identifiers SHOULD translate the local encoding to UTF-8, and the
173	results hex-encoded as appropriate for the URI or 8URI.

175	Whether the generated identifier should result in a URI or an 8URI
176	depends on the capability of the protocol or software to which the
177	result will be submitted.

179	This recommendation applies to HTTP servers as well as those systems
180	that generate and interpret URLs for FTP, gopher and the like.

182	3.3 Display of URIs

184	Many systems contain software that present URIs to users as part of
185	their user interface (sometimes presenting 'friendly' URIs). This
186	section applies to this presentation, as well as to the strategy for
187	printing URIs in magazines, newspapers, or reading them over the
188	radio.

190	Software that displays identifiers to users should follow a general
191	principle: "Don't display something to a user that the user would not
192	be able to enter." The consequences of this principle require
193	judgement about the availability of software that implements the entry
194	methods described in section 3.1.

196	a) In situations where a viewer is not likely to have software that
197	implements non-ASCII character entry as described in section 3.1, any
198	octet not representable by a character allowed in the [RFC 2396]
199	SHOULD be displayed as if it were hex-encoded.

201	b) In situations where a viewer _is_ likely to have such software,
202	sequences of octets MAY be displayed directly as the non-ASCII
203	character sequence it represents in UTF-8. Character sequences of
204	%HH-encoding which correspond to non-ASCII characters MAY be displayed
205	directly without decoding OR may be displayed as if it were a sequence
206	of hex-encoded UTF-8.

208	3.4 Interpretation of URIs

210	Software that interprets URIs as the names of local resources SHOULD
211	accept multiple renditions of the URIs in the case where those
212	resources names might have non-ASCII representations; this includes
213	accepting both the URI syntax of section 2.1 and the 8URI form in
214	section 2.2.

216	Just as allowing case-insensitive file names makes URIs more robust
217	(because the person viewing the URI might type the case differently
218	than it is displayed), similarly, URI-interpreting software should be
219	generous in allowing all of the possible representations that might
220	result from the recommendations in section 3.1. In addition, it is
221	useful if unaccented characters are accepted, when possible, as
222	aliases for accented characters, and that other equivalences are made.

224	For example, a URI which contains a string in Japanese might actually
225	arrive with a variety of encodings, due to the variety of
226	interpretations of deployed systems. While this recommendation
227	specifies a canonical encoding of Japanese using %HH-encoded UTF-8, in
228	practice many URIs will be presented which contain characters encoded
229	using Shift-JIS or EUC-JP, either with %HH encoding or not. Thus, to
230	transition to the new regime, URI-interpreting software for Japanese
231	should accept all three of the EUC-JP, Shift-JIS and UTF-8 encodings.

233	4. Upgrading

235	As this recommendation places further constraints on software for
236	which many instances are already deployed, it is important to
237	introduce upgrade carefully.

239	4.1 Upgrade sequence

241	The deployment strategy (for both hex-encoded and 8URIs) is in the
242	following sequence:

244	  Interpret  -->   Generation
245	              |
246	              +->  Entry   --> Display

248	Initially, it is most important to upgrade the URI interpreting
249	software according to the recommendations of section 3.4.

251	The upgrade of generating software to use UTF-8 (instead of a local
252	encoding) should happen only after the service is upgraded to accept
253	such URIs. Similarly, 8URIs should only be generated when the service
254	accepts 8URIs and the intervening infrastructure and protocol is known
255	to transport them safely.

257	Similarly, once interpreting software has been modified to accept
258	alternative encodings, then the entry software can also transition.

260	Display software should be upgraded only after upgraded entry software
261	has been widely deployed to the population that will see the displayed
262	result.

264	These recommendations, when taken together, will allow for the
265	extension of URIs to handle scripts other than ASCII while minimizing
266	interoperability problems.

268	4.2 Examples: upgrading URIs within various contexts

270	4.2.1 URIs within HTTP

272	The HTTP protocol [RFC HTTP] includes the URI of the resource being
273	accessed as the 'Request-URI' in the request line. Most deployed HTTP
274	servers that access resources with localized non-ASCII naming do not
275	translate the Request-URI's character encoding to a local form, and
276	will need to be upgraded to accept such aliases.  Most deployed HTTP
277	servers do not do not restrict the octets allowed in the protocol, and
278	so an upgrade from URI to 8URI will not be difficult.

280	4.2.2 URIs within HTML and XML

282	Within a HTML [HTML4] or XML [XML1] document the primary difficulty
283	for the use of 8URIs is that the document itself may be represented
284	and labelled with a charset other than UTF-8. In such situations, the
285	document as a whole might be transcoded into another
286	encoding. However, the hex-encoded URIs following the recommendations
287	of this document should pass from the recipient of the document back
288	into the URI interpreting infrastructure without change.

290	4.2.3 URIs within email and text/plain

292	E-mail messages are frequently transmitted as text/plain; the use of
293	octets outside of US-ASCII requires an encoding of the message using
294	quoted-printable or base64. In addition, text messages that arrive
295	with charset=utf-8 may be transcoded into a local character
296	representation before storage or display. Thus, URIs within email
297	messages should likely remain within the limited repertoire rather
298	than the 8URI representation.

300	However, it is now common for email software to recognize embedded
301	URIs within email messages and present them specially, e.g., as
302	hypertext links. Within such systems, it is reasonable to upgrade
303	the email display software to present URIs as the natural characters
304	they represent, as long as the entry software in the same system
305	has been upgraded.

307	5. Security Considerations

309	If URI entry software is upgraded to normalize the characters entered,
310	but the URI interpreting software has not been upgraded to treat
311	multiple forms as equivalent, this introduces the possibility of
312	"spoofing": having different resources whose URIs look the same but
313	are not the same. For example, if "abc" and "def" are different
314	encodings of the same visual characters, "http://a.com/abc" and
315	"http://a.com/def" might look the same to users, might display the
316	same, and different URI entry software components might generate
317	different ones; e.g., EUC-JP-based Japanese URI entry software might
318	generate one encoding, while UTF-8-based software would generate
319	another one. In this case, if "a.com" allows multiple users to
320	establish different areas, it might be possible for someone other than
321	the owner of "http://a.com/abc" to put different content at
322	"http://a.com/def" and "spoof" the results.

324	Conceptually, this is no different from the problems surrounding the
325	use of case-insensitive web servers.  For example, a popular web page
326	with a mixed case name (http://big.site/PopularPage) might be
327	"spoofed" by someone who obtains access to
328	(http://big.site/popularpage).  However, the introduction of the
329	Unicode canonicalization rules in conjunction with mapping from
330	multiple possible native encodings might result in aliasing which is
331	difficult to determine in advance. Administrators of large sites which
332	allow independent users to create subareas may need to be careful that
333	the aliasing rules do not create such conflicts.

335	6. Acknowledgements

337	Thanks to Francois Yergeau, Chris Wendt, Yaron Goland, Graham Klyne,
338	Roy Fielding and many others for help with this document.

340	7. Copyright

342	Copyright (C) The Internet Society, 1997. All Rights Reserved.

344	This document and translations of it may be copied and furnished to
345	others, and derivative works that comment on or otherwise explain it
346	or assist in its implementation may be prepared, copied, published and
347	distributed, in whole or in part, without restriction of any kind,
348	provided that the above copyright notice and this paragraph are
349	included on all such copies and derivative works.  However, this
350	document itself may not be modified in any way, such as by removing
351	the copyright notice or references to the Internet Society or other
352	Internet organizations, except as needed for the purpose of developing
353	Internet standards in which case the procedures for copyrights defined
354	in the Internet Standards process must be followed, or as required to
355	translate it into languages other than English.

357	The limited permissions granted above are perpetual and will not be
358	revoked by the Internet Society or its successors or assigns.

360	This document and the information contained herein is provided on an
361	"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
362	TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT
363	NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN
364	WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
365	MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."

367	8. Author's address

369		Larry Masinter
370		Xerox Corporation
371		3333 Coyote Hill Road
372		Palo Alto, CA 94304
373		masinter@parc.xerox.com
374		http://www.parc.xerox.com/masinter
375		Fax: +1 650 812-4333

377		Martin J. Duerst
378	        W3C/Keio University
379	        5322 Endo, Fujisawa
380	        252-8520 Japan
381	        duerst@w3.org
382	        http://www.w3.org/People/D%C3%BCrst/
383	        Tel/Fax: +81 466 49 1170

385	9. References

387	[RFC 2119]    S. Bradner, "Key words for use in RFCs to Indicate
388	              Requirement Levels", March 1997.

390	[RFC 2279]    F. Yergeau. "UTF-8, a transformation format of ISO 10646."
391	              January 1998.

393	[RFC 2396]    T.Berners-Lee, R.Fielding, L.Masinter. "Uniform
394	              Resource Identifiers (URI): Generic Syntax." August,
395	              1998.

397	[UNI15]       M.Davis, "Unicode Normalization Forms", Draft Unicode
398	              Technical Report #15, August 1998.

400	[RFC HTTP]    R.Fielding, J.Gettys, et al, "Hypertext Transfer Protocol --
401	              HTTP/1.1", <draft-ietf-http-v11-spec-rev-04.txt>.

403	[RFC 2141]    R. Moats, "URN Syntax", May 1997.

405	[RFC 2192]    C. Newman, "IMAP URL Scheme", September 1997.

407	[RFC 2384]    R. Gellens, "POP URL Scheme", August 1998.

409	[RFC FTP]     B. Curtis, "Internationalization of the File Transfer Protocol",
410	              <draft-ietf-ftpext-intl-ftp-05.txt>.

412	[HTML4]       "HTML 4.0", World Wide Web Consortium,
413	               <http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2>.

415	[XMl1]        "XML 1.0", World Wide Web Consortium Recommendation,
416	              <http://www.w3.org/TR/REC-xml#sec-external-ent>.