idnits 2.17.1 draft-masinter-url-i18n-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 420 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 3 instances of too long lines in the document, the longest one being 6 characters in excess of 72. ** There are 8 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 30, 1998) is 9368 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC-DUERST' is mentioned on line 147, but not defined == Missing Reference: 'XML1' is mentioned on line 282, but not defined == Unused Reference: 'RFC 2119' is defined on line 387, but no explicit reference was found in the text == Unused Reference: 'RFC FTP' is defined on line 409, but no explicit reference was found in the text == Unused Reference: 'XMl1' is defined on line 415, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2279 (Obsoleted by RFC 3629) ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNI15' -- Possible downref: Non-RFC (?) normative reference: ref. 'RFC HTTP' ** Obsolete normative reference: RFC 2141 (Obsoleted by RFC 8141) ** Obsolete normative reference: RFC 2192 (Obsoleted by RFC 5092) -- Possible downref: Non-RFC (?) normative reference: ref. 'RFC FTP' -- Possible downref: Non-RFC (?) normative reference: ref. 'HTML4' -- Possible downref: Non-RFC (?) normative reference: ref. 'XMl1' Summary: 14 errors (**), 0 flaws (~~), 7 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT Larry Masinter 2 Xerox Corporation 3 Martin Duerst 4 W3C/Keio University 5 draft-masinter-url-i18n-02 August 30, 1998 6 Expires in 6 months 8 Representing non-ASCII Characters in URIs and Extended URIs 10 Status of this Memo 12 This document is an Internet-Draft. Internet-Drafts are working 13 documents of the Internet Engineering Task Force (IETF), its areas, 14 and its working groups. Note that other groups may also distribute 15 working documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six months 18 and may be updated, replaced, or obsoleted by other documents at any 19 time. It is inappropriate to use Internet-Drafts as reference 20 material or to cite them other than as ``work in progress.'' 22 To view the entire list of current Internet-Drafts, please check 23 the "1id-abstracts.txt" listing contained in the Internet-Drafts 24 Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net 25 (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au 26 (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US 27 West Coast). 29 This document is not a product of any working group, but may 30 be discussed on the mailing list url-i18n@unicode.org. 32 Abstract 34 URIs are defined as sequences of characters chosen from a limited 35 subset of the repertoire of ASCII characters, both for transmission in 36 network protocols and representation in spoken and written human 37 communication. 39 This document defines a uniform way of representing non-ASCII scripts 40 in URIs and in an extended 8-bit form (8URI), so these identifiers can 41 be used for the world's languages. The document gives guidelines for 42 the use and deployment of these forms in various elements of software 43 that deal with URIs. 45 1. Introduction 47 URIs [RFC 2396] are defined as sequences of characters chosen from a 48 limited subset of the repertoire of ASCII characters. The characters 49 in URIs are frequently used for representing English words and 50 phrases; unfortunately, this leaves out most of the world, who do not 51 write merely with the letters A-Z. 53 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 54 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 55 document are to be interpreted as described in RFC 2119. 57 2. Syntax 59 This document defines two ways of representing non-ASCII characters in 60 resource identifiers: a URI syntax which is compatible with the 61 definition of URI syntax [RFC 2396], and a new syntax which is usable 62 in contexts where resource identifiers are transported within "8-bit" 63 environments. This new syntax is called an "8URI"; it is upward 64 compatible with the URI syntax, but is defined as a sequence of 8-bit 65 octets. 67 2.1 URI syntax 69 The standard definition of URIs [RFC 2396] requires that URIs be 70 represented with a very limited repertoire of characters which are a 71 subset of those characters representable in ASCII. URIs are defined as 72 a sequence of characters (since URIs may be written on paper or read 73 out loud) which my be represented as a sequence of 7-bit bytes. 75 Character sequences that include non-ASCII characters must be 76 transcribed to represent them in URIs. The transcription to be applied 77 to a character sequence before it is included in an element of a URI 78 (path, etc.) SHOULD be performed by: 80 1) representing the characters as a sequence of ISO 10646 characters. 81 2) "normalizing" the character sequence to reduce ambiguity. 82 [UNI15] defines several normalization forms; for the purpose 83 of representing characters in URIs, "Normalization Form CC". 84 3) encoding the result with the UTF-8 character encoding [RFC 2279] 85 4) using %HH hex-encoding [RFC 2396] to encode any octet that 86 does not correspond to an allowed, non-reserved character. 88 This syntax is consistent with the definition of the generic URI 89 syntax [RFC 2396], the URN syntax [RFC 2141], as well as recent URL 90 scheme definitions [RFC 2192], [RFC 2384]. 92 2.2 8URI syntax 94 This specification defines a new protocol element, called an '8URI'. 95 An 8URI is similar to a URI in its use, but is different in that it is 96 solely for use in network protocols that allow the transport of octets 97 outside of the range allowed within URIs. An 8URI MAY have 8-bit 98 octets within it. An 8URI is represented using the same methods (1-4) 99 defined in section 2.1, but in step (4), octets with the leading bit 100 on need not be encoded; all characters outside of those explicitly 101 disallowed in RFC 2396 (reserved, delimiters, white space, unwise 102 special characters) MAY be represented directly by their UTF-8 103 encoding. 105 An '8URI' for characters outside of the ASCII range will use 106 considerably less space than the corresponding hex-encoded URI. 108 Even within 8URIs, any octet sequence which would likely yield 109 ambiguous or incorrect results when printed or displayed and then 110 subsequently typed by a user SHOULD be hex-encoded. 112 Internet protocols that currently allow the designation of a URI may 113 be extended at some point to allow 8URIs as well as URIs, but this 114 extension must be done explicitly. Section 4 lays out some of the 115 software guidelines that will allow the deployment of 8URIs in 116 existing Internet Protocols. 118 3. Software Requirements and Upgrade Strategy 120 Supporting URIs for non-ASCII characters requires cooperation from the 121 providers of several different components of URI software: software 122 that allows users to enter URIs, software that generates URIs, 123 software that displays URIs, and software that interprets URIs. 125 3.1 URI entry 127 One component of software that deals with URIs allows users to enter a 128 URI, e.g., by typing or dictation. For example, a person viewing a 129 visual representation of a URI (as a sequence of glyphs, in some 130 order, in some visual display) might use a keyboard entry method for 131 keys in that language to create the URI. For ASCII characters with 132 standard English keyboards, the process is simple, since there is 133 generally a simple correspondence between letters represented, keys 134 pressed, and internal system representation, but for other languages 135 the process is much more complex. 137 If the visual representation contains only those characters that are 138 allowed [RFC 2396] standard syntax of URIs, the transcription is 139 simple. However, for all other sequences of characters, it is 140 RECOMMENDED that the entry results in characters, in logical order 141 from the ISO 10646 character repertoire, encoded using the UTF-8 142 method [RFC 2279], and then subsequently encoded as necessary using 143 the URI hex-encoding. The set of octets that require encoding 144 depending on whether the result is a URI or an 8URI. 146 The characters the user has entered should be normalized according to 147 the rules in [RFC-DUERST]; for example, all accented characters should 148 be translated into their combined form, no extraneous BIDI 149 (bidirectional) marks should be left in the resulting stream, and that 150 characters that are intended to represent Western European letters 151 should be transcribed into their ISO-8859-1 equivalents and not, for 152 example, as double-wide characters. 154 Whether URI entry should result in a URI or an 8URI will depend on the 155 capability of the protocol or software to which the result will be 156 submitted. 158 3.2 URI generation 160 Systems that are offering resources through the Internet, where those 161 resources have logical names, sometimes offer the ability to generate 162 URIs for the resources they offer. For example, some HTTP servers 163 offer the ability to generate a 'directory listing' for file 164 directories under their purvue, and then to respond to the generated 165 URIs with the files. If the names of the files consist solely of 166 US-ASCII characters the transcription is simple, but other file 167 systems offer a wider variety of characters. Many currently deployed 168 systems currently do not transform the local character representation 169 of the underlying system before generating URIs. 171 For maximum interoperability, systems that generate resource 172 identifiers SHOULD translate the local encoding to UTF-8, and the 173 results hex-encoded as appropriate for the URI or 8URI. 175 Whether the generated identifier should result in a URI or an 8URI 176 depends on the capability of the protocol or software to which the 177 result will be submitted. 179 This recommendation applies to HTTP servers as well as those systems 180 that generate and interpret URLs for FTP, gopher and the like. 182 3.3 Display of URIs 184 Many systems contain software that present URIs to users as part of 185 their user interface (sometimes presenting 'friendly' URIs). This 186 section applies to this presentation, as well as to the strategy for 187 printing URIs in magazines, newspapers, or reading them over the 188 radio. 190 Software that displays identifiers to users should follow a general 191 principle: "Don't display something to a user that the user would not 192 be able to enter." The consequences of this principle require 193 judgement about the availability of software that implements the entry 194 methods described in section 3.1. 196 a) In situations where a viewer is not likely to have software that 197 implements non-ASCII character entry as described in section 3.1, any 198 octet not representable by a character allowed in the [RFC 2396] 199 SHOULD be displayed as if it were hex-encoded. 201 b) In situations where a viewer _is_ likely to have such software, 202 sequences of octets MAY be displayed directly as the non-ASCII 203 character sequence it represents in UTF-8. Character sequences of 204 %HH-encoding which correspond to non-ASCII characters MAY be displayed 205 directly without decoding OR may be displayed as if it were a sequence 206 of hex-encoded UTF-8. 208 3.4 Interpretation of URIs 210 Software that interprets URIs as the names of local resources SHOULD 211 accept multiple renditions of the URIs in the case where those 212 resources names might have non-ASCII representations; this includes 213 accepting both the URI syntax of section 2.1 and the 8URI form in 214 section 2.2. 216 Just as allowing case-insensitive file names makes URIs more robust 217 (because the person viewing the URI might type the case differently 218 than it is displayed), similarly, URI-interpreting software should be 219 generous in allowing all of the possible representations that might 220 result from the recommendations in section 3.1. In addition, it is 221 useful if unaccented characters are accepted, when possible, as 222 aliases for accented characters, and that other equivalences are made. 224 For example, a URI which contains a string in Japanese might actually 225 arrive with a variety of encodings, due to the variety of 226 interpretations of deployed systems. While this recommendation 227 specifies a canonical encoding of Japanese using %HH-encoded UTF-8, in 228 practice many URIs will be presented which contain characters encoded 229 using Shift-JIS or EUC-JP, either with %HH encoding or not. Thus, to 230 transition to the new regime, URI-interpreting software for Japanese 231 should accept all three of the EUC-JP, Shift-JIS and UTF-8 encodings. 233 4. Upgrading 235 As this recommendation places further constraints on software for 236 which many instances are already deployed, it is important to 237 introduce upgrade carefully. 239 4.1 Upgrade sequence 241 The deployment strategy (for both hex-encoded and 8URIs) is in the 242 following sequence: 244 Interpret --> Generation 245 | 246 +-> Entry --> Display 248 Initially, it is most important to upgrade the URI interpreting 249 software according to the recommendations of section 3.4. 251 The upgrade of generating software to use UTF-8 (instead of a local 252 encoding) should happen only after the service is upgraded to accept 253 such URIs. Similarly, 8URIs should only be generated when the service 254 accepts 8URIs and the intervening infrastructure and protocol is known 255 to transport them safely. 257 Similarly, once interpreting software has been modified to accept 258 alternative encodings, then the entry software can also transition. 260 Display software should be upgraded only after upgraded entry software 261 has been widely deployed to the population that will see the displayed 262 result. 264 These recommendations, when taken together, will allow for the 265 extension of URIs to handle scripts other than ASCII while minimizing 266 interoperability problems. 268 4.2 Examples: upgrading URIs within various contexts 270 4.2.1 URIs within HTTP 272 The HTTP protocol [RFC HTTP] includes the URI of the resource being 273 accessed as the 'Request-URI' in the request line. Most deployed HTTP 274 servers that access resources with localized non-ASCII naming do not 275 translate the Request-URI's character encoding to a local form, and 276 will need to be upgraded to accept such aliases. Most deployed HTTP 277 servers do not do not restrict the octets allowed in the protocol, and 278 so an upgrade from URI to 8URI will not be difficult. 280 4.2.2 URIs within HTML and XML 282 Within a HTML [HTML4] or XML [XML1] document the primary difficulty 283 for the use of 8URIs is that the document itself may be represented 284 and labelled with a charset other than UTF-8. In such situations, the 285 document as a whole might be transcoded into another 286 encoding. However, the hex-encoded URIs following the recommendations 287 of this document should pass from the recipient of the document back 288 into the URI interpreting infrastructure without change. 290 4.2.3 URIs within email and text/plain 292 E-mail messages are frequently transmitted as text/plain; the use of 293 octets outside of US-ASCII requires an encoding of the message using 294 quoted-printable or base64. In addition, text messages that arrive 295 with charset=utf-8 may be transcoded into a local character 296 representation before storage or display. Thus, URIs within email 297 messages should likely remain within the limited repertoire rather 298 than the 8URI representation. 300 However, it is now common for email software to recognize embedded 301 URIs within email messages and present them specially, e.g., as 302 hypertext links. Within such systems, it is reasonable to upgrade 303 the email display software to present URIs as the natural characters 304 they represent, as long as the entry software in the same system 305 has been upgraded. 307 5. Security Considerations 309 If URI entry software is upgraded to normalize the characters entered, 310 but the URI interpreting software has not been upgraded to treat 311 multiple forms as equivalent, this introduces the possibility of 312 "spoofing": having different resources whose URIs look the same but 313 are not the same. For example, if "abc" and "def" are different 314 encodings of the same visual characters, "http://a.com/abc" and 315 "http://a.com/def" might look the same to users, might display the 316 same, and different URI entry software components might generate 317 different ones; e.g., EUC-JP-based Japanese URI entry software might 318 generate one encoding, while UTF-8-based software would generate 319 another one. In this case, if "a.com" allows multiple users to 320 establish different areas, it might be possible for someone other than 321 the owner of "http://a.com/abc" to put different content at 322 "http://a.com/def" and "spoof" the results. 324 Conceptually, this is no different from the problems surrounding the 325 use of case-insensitive web servers. For example, a popular web page 326 with a mixed case name (http://big.site/PopularPage) might be 327 "spoofed" by someone who obtains access to 328 (http://big.site/popularpage). However, the introduction of the 329 Unicode canonicalization rules in conjunction with mapping from 330 multiple possible native encodings might result in aliasing which is 331 difficult to determine in advance. Administrators of large sites which 332 allow independent users to create subareas may need to be careful that 333 the aliasing rules do not create such conflicts. 335 6. Acknowledgements 337 Thanks to Francois Yergeau, Chris Wendt, Yaron Goland, Graham Klyne, 338 Roy Fielding and many others for help with this document. 340 7. Copyright 342 Copyright (C) The Internet Society, 1997. All Rights Reserved. 344 This document and translations of it may be copied and furnished to 345 others, and derivative works that comment on or otherwise explain it 346 or assist in its implementation may be prepared, copied, published and 347 distributed, in whole or in part, without restriction of any kind, 348 provided that the above copyright notice and this paragraph are 349 included on all such copies and derivative works. However, this 350 document itself may not be modified in any way, such as by removing 351 the copyright notice or references to the Internet Society or other 352 Internet organizations, except as needed for the purpose of developing 353 Internet standards in which case the procedures for copyrights defined 354 in the Internet Standards process must be followed, or as required to 355 translate it into languages other than English. 357 The limited permissions granted above are perpetual and will not be 358 revoked by the Internet Society or its successors or assigns. 360 This document and the information contained herein is provided on an 361 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 362 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT 363 NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN 364 WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 365 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." 367 8. Author's address 369 Larry Masinter 370 Xerox Corporation 371 3333 Coyote Hill Road 372 Palo Alto, CA 94304 373 masinter@parc.xerox.com 374 http://www.parc.xerox.com/masinter 375 Fax: +1 650 812-4333 377 Martin J. Duerst 378 W3C/Keio University 379 5322 Endo, Fujisawa 380 252-8520 Japan 381 duerst@w3.org 382 http://www.w3.org/People/D%C3%BCrst/ 383 Tel/Fax: +81 466 49 1170 385 9. References 387 [RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate 388 Requirement Levels", March 1997. 390 [RFC 2279] F. Yergeau. "UTF-8, a transformation format of ISO 10646." 391 January 1998. 393 [RFC 2396] T.Berners-Lee, R.Fielding, L.Masinter. "Uniform 394 Resource Identifiers (URI): Generic Syntax." August, 395 1998. 397 [UNI15] M.Davis, "Unicode Normalization Forms", Draft Unicode 398 Technical Report #15, August 1998. 400 [RFC HTTP] R.Fielding, J.Gettys, et al, "Hypertext Transfer Protocol -- 401 HTTP/1.1", . 403 [RFC 2141] R. Moats, "URN Syntax", May 1997. 405 [RFC 2192] C. Newman, "IMAP URL Scheme", September 1997. 407 [RFC 2384] R. Gellens, "POP URL Scheme", August 1998. 409 [RFC FTP] B. Curtis, "Internationalization of the File Transfer Protocol", 410 . 412 [HTML4] "HTML 4.0", World Wide Web Consortium, 413 . 415 [XMl1] "XML 1.0", World Wide Web Consortium Recommendation, 416 .