idnits 2.17.1 draft-duerst-iri-bidi-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 382 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 6 instances of too long lines in the document, the longest one being 2 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 13, 2001) is 8322 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'Nameprep' is defined on line 342, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'HTML4' -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN' -- Possible downref: Non-RFC (?) normative reference: ref. 'IRI' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'Nameprep' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode' -- Possible downref: Non-RFC (?) normative reference: ref. 'UnicodeBidi' -- Possible downref: Non-RFC (?) normative reference: ref. 'W3C IRI' -- Possible downref: Non-RFC (?) normative reference: ref. 'XML' Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT Martin Duerst 2 W3C/Keio University 3 draft-duerst-iri-bidi-00.txt 4 Expires January 2002 July 13, 2001 6 Internet Identifiers and Bidirectionality 8 Status of this Memo 10 This document is an Internet-Draft and is in full conformance with all 11 provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering Task 14 Force (IETF), its areas, and its working groups. Note that other 15 groups may also distribute working documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six months 18 and may be updated, replaced, or obsoleted by other documents at any 19 time. It is inappropriate to use Internet- Drafts as reference 20 material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt. 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html. 28 This document is not a product of any working group, but should be 29 discussed on the mailing list . Comments of editorial 30 nature should be sent directly to the author. For more information 31 on the topic of this Internet-Draft, please also see [W3C IRI]. 33 Abstract 35 This memo describes how to deal with Internet identifiers containing 36 characters from scripts such as Arabic and Hebrew, which use right-to- 37 left or bidirectional writing. The solution proposed addresses three 38 different contexts: The purely graphical representation of such 39 identifiers, e.g. on paper, the embedding of such identifiers into 40 running text with established rules for bidirectionality, and the 41 processing and resolution of such identifiers. 43 0. Change history 45 Version 00: 47 This memo has been separated out from [IRI], Section 3.2 to allow 48 more in-depth and focused discussion of the specific problems of 49 bidirectionality. 51 1. Introduction 53 There is an increased tendency to allow identifiers to use a wide 54 range of characters from the scripts of the world. The Universal 55 Character Set (UCS, see [Unicode] and [ISO10646]) makes it easy to 56 use and exchange such identifiers digitally. With the appropriate 57 care (similar to the care needed to avoid confusion between '1', 'l', 58 and 'I' in US-ASCII-based identifiers), such identifiers can also 59 be exchanged non-digitally, e.g. written down visually on a medium 60 such as paper. Potential examples of such idenitifiers include 61 Internationalized Resource Identifiers [IRI], Internationalized 62 Domain Names [IDN], and internationalized email addresses. 64 Some characters, in particular those of the Arabic and the Hebrew 65 script, are written from right to left. Together with characters 66 written from left to right, or with digits that are written from 67 left to right even in these scripts, this gives raise to the 68 mixture of different writing directions, a phenomenon called 69 bidirectionality. Dealing with bidirectionality is indispesable 70 for the proper treatment of text written with the Arabic or Hebrew 71 script. But it is highly complex because user expectations may 72 depend on context and are often difficult to identify and express. 74 This memo deals with the specific problems of Internet identifiers 75 containing rigth-to-left characters, hereafter called bidirectional 76 identifiers. 78 The basic paradigm of all modern bidirectional text handling solutions 79 is the distinction between digital backing store, where text is stored 80 in logical order, and rendering (display or printing), for which the 81 necessary reordering is applied according to well-defined rules. 82 'Logical order' in this context is the order in which the characters 83 in the text are pronounced or spelled out. 85 Using logical order in the digital backing store simplifies a large 86 number of operations, including sorting, searching, text-to-speech 87 conversion, various other kinds of linguistic processing, input from 88 keyboards and other devices, and rendering-related operations such as 89 line breaking and text reflow. The alternative is to use display order 90 even in the backing store, but this makes some of the operations above 91 much more complex and others impossible. 93 For general text (e.g. average prose,...), the Unicode bidirectional 94 algorithm [UnicodeBidi] is the single widely accepted and used reference 95 for providing this reordering from logical order to rendering. The 96 Unicode bidirectional algorithm consists of an implicit part (producing 97 adequate results in most cases) and explicit formatting characters for 98 advanced cases. 100 The Unicode bidirectional algorithm also allows higher-order protocols 101 o overwrite certain aspects of the algorithm. A case where this has 102 been done is the 'dir' attribute in [HTML4]. 104 Bidirectional Internet identifiers primarily are used in three different 105 contexts: 107 1) In visual form: This includes display on CTR and LCDs as well as 108 more permanent visual forms such as printing. At least as far 109 as the reading of individual components is concerned, the visual 110 form has to use the inherent directionality of the characters used. 111 Otherwise, identification, reading, transcription, and so on are 112 severely affected. 114 2) In digital form inside running text (e.g. an IRI or an email 115 address in an email or on a web page). It is not always easy 116 or possible to distinguish identifiers from other text. 118 3) In digital form on its own (e.g. in a structured format or 119 database of identifiers, or when transmitted for resolution). 120 It should be possible to process bidirectional identifiers 121 in the same way as other Internet identifiers. 123 This memo addresses all these three cases as well as the conversion 124 between them. The specifics of bidirectional text and of identifier 125 structure make it impossible to design a solution that works without 126 additional effort (when compared to non-bidirectional identifiers). 127 However, the solution proposed in this memo is designed to make the 128 best out of the severe constraints. 130 2. Notational Conventions 132 Keywords in all upper-case such as MUST and SHOULD are defined 133 in [RFC 2119]. For examples, lower-case letters are used for 134 letters that flow left to right. Upper-case letters stand for 135 letters that flow from right to left. A left-to-right example 136 would be 'hello', whereas a right-to-left example would be 137 'OLLEH'. 139 For bidirectional formatting characters from [Unicode], the [XML]-style 140 entitiy notation is used, as follows: 142 ‎ U+200E LEFT-TO-RIGHT MARK 143 ‏ U+200F RIGHT-TO-LEFT MARK 144 &lre; U+202A LEFT-TO-RIGHT EMBEDDING 145 &rle; U+202B RIGHT-TO-LEFT EMBEDDING 146 &pdf; U+202C POP DIRECTIONAL FORMATTING 147 &lro; U+202D LEFT-TO-RIGHT OVERRIDE 148 &rlo; U+202E RIGHT-TO-LEFT OVERRIDE 150 Only the first two are defined in [HTML4]; the others are 151 replaced by the 'dir' attribute and the element. 153 3. Identifier Structure 155 Most Internet identifiers have an inherent structure that distinguishes 156 structural characters (usually punctuation such as '@', '.', ':', '/', 157 and so on) and payload components (usually formed with plain alphabetic 158 or alphanumeric characters). 160 In order to be able to process bidirectional identifiers in the same 161 way as other identifiers, it is crucial that in the digital 162 representations, the individual structural characters and identifier 163 components are stored in the same sequence as for other identifiers. 165 The main problem to solve for the visual representation of bidirectional 166 identifiers is whether the general sequence of components and syntax 167 characters should be from left to right or from right to left, i.e. 168 whether the right-to-left equivalent of "ftp.example.com" should be 169 "MOC.ELPMAXE.PTF" or "PTF.ELPMAXE.MOC". The former one may be 170 seen as more natural in a purely right-to-left context. But there is also 171 the possibility of mixed identifiers such as "PTF.ELPMAXE.com". 172 These provide a very strong motivation for maintaining the same 173 left-to-right overall component sequence for all Internet identifiers. 175 The Unicode bidirectional algorithm, extremely simplified, tries to 176 reorder continuous sequences of right-to-left characters between 177 continuous sequences of left-to-right characters. A third category, 178 called neutrals, is processed in the same way as surrounding characters. 179 The main problem for identifiers is that all the structural characters 180 are treated as neutrals by the Unicode algorithm, which means that they 181 are moved around together with their context. As an example, the 182 logical sequence FTP.EXAMPLE.com (corresponding to the example above), 183 without additional care is displayed as ELPMAXE.PTF.com, which is 184 obviously highly confusing. 186 4. Bidirectional Identifiers in Context 188 4.1 Independently Processed Bidirectional Identifiers 190 Bidirectional identifiers processed independently, i.e. stored or 191 transmitted for resolution, MUST be in full logical order both 192 for the overall structure as well as for the individual components. 193 They MUST conform directly to the relevant syntax rules. 195 4.2 Visual Rendering of Bidirectional Identifiers 197 Bidirectional Identifiers MUST be rendered visually by rendering 198 each component and each structural character from left to right. 199 They MUST render each component according to its natural direction 200 (i.e. left-to-right for components with left-to-right characters, 201 right-to-left for components with right-to-left characters). 203 4.3 Bidirectional Identifiers in Textual Context 205 In textual context, i.e. assuming rendering by the Unicode bidirectional 206 algorithm, the backing store representation prescribed in Section 4.1 207 and the visual rendering prescribed in section 4.2 have to be 208 combined. This is done as follows: 210 - Each component with right-to-left characters is preceded and 211 followed by an ‎. This left-to-right mark provides a 212 left-to-right context to intervening syntactic characters. 214 - If the overall context (base directionality) is right-to-left, 215 the identifier is preceded by an &lre; and followed by a &pdf;. 216 This makes sure that the components of the identifier are 217 rendered in left-to-right order. This may also be done by 218 using the equivalent features of a higher-order protocol 219 (e.g. by using the dir='ltr' attribute in HTML). 221 4.4 Conversions 223 Conversion from textual context to visual representation is done 224 simply by applying the Unicode bidirectional algorithm, i.e. by 225 passing the whole text to an appropriate rendering engine. 227 Conversion from processing representation to textual context is 228 done by adding the necessary formatting characters as described 229 in Section 4.3. 231 Conversion from textual context to processing representation is 232 done by removing the formating characters at the positions 233 described in Section 4.3. For international domain names, this 234 can e.g. be integrated in [nameprep]. 236 Conversion from visual representation to processing representation 237 is done by inputting the identifier, component-by-component from 238 left to right, using the natural reading order for each component. 240 From these three conversions, the remaining conversions can be 241 easily constructed. Any other procedure that leads to the same results 242 is also allowed. 244 5. Restrictions 246 The definitions and conversions in Section 4 only work under the 247 following restrictions. 249 1) A component MUST NOT not use both right-to-left and left-to-right 250 characters. 252 2) A component MUST NOT contain bidirectional formatting characters 253 except for those and in those positions as defined in Section 4.3. 255 3) A component using right-to-left characters MUST NOT use any other 256 class of characters (e.g. neutrals or numbers). 258 Restrictions 1) and 2) are not very severe, in that they do not overly 259 restrict useful identifiers. Also, trying to remove it would make it 260 impossible for humans to predict the logical sequence of characters 261 inside a single component. On the other hand, it would be very desirable 262 to remove or at least soften restriction 3). Otherwise, it is impossible 263 to combine Arabic or Hebrew letters with numbers, or to use a hyphen 264 between two subcomponents of an Arabic component to avoid the cursive 265 connection of the two subcomponents. To a certain extent, softening this 266 restriction should be easily possible by adding additional formating 267 characters in well defined ways similar to the provisions in Section 4.3. 268 Feedback on this issue is particularly welcome. 270 6. Security Considerations 272 Knowledge of deficiencies of a particular implementation of the above 273 specification can allow somebody to pretend to resolve a particular 274 identifier when indeed another identifier is being resolved. 276 Acknowledgements 278 The basic idea for the approach proposed in this memo are due to 279 Francois Yergeau, and go back to around 1995. Discussions with 280 Stephen Atkin, Paul Hoffman, and many others provided additional 281 motivation and insight. 283 Copyright 285 Copyright (C) The Internet Society, 1997. All Rights Reserved. 287 This document and translations of it may be copied and furnished to 288 others, and derivative works that comment on or otherwise explain it 289 or assist in its implementation may be prepared, copied, published 290 and distributed, in whole or in part, without restriction of any 291 kind, provided that the above copyright notice and this paragraph 292 are included on all such copies and derivative works. However, this 293 document itself may not be modified in any way, such as by removing 294 the copyright notice or references to the Internet Society or other 295 Internet organizations, except as needed for the purpose of 296 developing Internet standards in which case the procedures for 297 copyrights defined in the Internet Standards process must be 298 followed, or as required to translate it into languages other 299 than English. 301 The limited permissions granted above are perpetual and will not be 302 revoked by the Internet Society or its successors or assigns. 304 This document and the information contained herein is provided on an 305 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 306 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 307 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 308 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 309 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." 311 Author's address 313 Martin J. Duerst 314 W3C/Keio University 315 5322 Endo, Fujisawa 316 252-8520 Japan 317 duerst@w3.org 318 http://www.w3.org/People/D%C3%BCrst/ 319 Tel/Fax: +81 466 49 1170 321 Note: Please write "Duerst" with u-umlaut wherever 322 possible, e.g. as "Dürst" in XML and HTML. 324 References 326 [HTML4] "HTML 4.01", World Wide Web Consortium, 327 . 329 [IDN] Internationalized Domain Name (idn) IETF Working Group. For 330 furter information, please see 331 . 333 [IRI] L. Masinter, M. Duerst, "Internationalized Resource Identifiers 334 (IRI)", Internet Draft, Jan. 2001, 335 , 336 work in progress. 338 [ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet 339 Coded Character Set (UCS) - Part 1: Architecture and Basic 340 Multilingual Plane, Oct. 2000, with amendments. 342 [Nameprep] P. Hoffman, M. Blanchet, "Preparation of Internationalized 343 Host Names", Internet Draft, Feb. 2001, 344 , 345 work in progress. 347 [RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate 348 Requirement Levels", March 1997. 350 [Unicode] The Unicode Consortium, "The Unicode Standard, Version 3.1", 351 consisting of: "The Unicode Standard, Version 3.0", Addison-Wesley, 352 Reading, MA, 2000, and "Unicode Standard Annex #27: Unicode 3.1", 353 , May 2001. 355 [UnicodeBidi] The Unicode Consortium, "The Unicode Standard, Version 356 3.0", Addison-Wesley, Reading, MA, 2000, Section 3.12, pp. 55-69, also 357 available at 358 and "Unicode Standard Annex #9: The Bidirectional Algorithm", 359 , March 2001. 361 [W3C IRI] Internationalization - URIs and other identifiers 362 . 364 [XML] "XML 1.0", World Wide Web Consortium Recommendation, 365 .