INTERNET-DRAFT Martin Duerst W3C/Keio University draft-duerst-iri-bidi-00.txt Expires January 2002 July 13, 2001 Internet Identifiers and Bidirectionality Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This document is not a product of any working group, but should be discussed on the mailing list . Comments of editorial nature should be sent directly to the author. For more information on the topic of this Internet-Draft, please also see [W3C IRI]. Abstract This memo describes how to deal with Internet identifiers containing characters from scripts such as Arabic and Hebrew, which use right-to- left or bidirectional writing. The solution proposed addresses three different contexts: The purely graphical representation of such identifiers, e.g. on paper, the embedding of such identifiers into running text with established rules for bidirectionality, and the processing and resolution of such identifiers. 0. Change history Version 00: This memo has been separated out from [IRI], Section 3.2 to allow more in-depth and focused discussion of the specific problems of bidirectionality. 1. Introduction There is an increased tendency to allow identifiers to use a wide range of characters from the scripts of the world. The Universal Character Set (UCS, see [Unicode] and [ISO10646]) makes it easy to use and exchange such identifiers digitally. With the appropriate care (similar to the care needed to avoid confusion between '1', 'l', and 'I' in US-ASCII-based identifiers), such identifiers can also be exchanged non-digitally, e.g. written down visually on a medium such as paper. Potential examples of such idenitifiers include Internationalized Resource Identifiers [IRI], Internationalized Domain Names [IDN], and internationalized email addresses. Some characters, in particular those of the Arabic and the Hebrew script, are written from right to left. Together with characters written from left to right, or with digits that are written from left to right even in these scripts, this gives raise to the mixture of different writing directions, a phenomenon called bidirectionality. Dealing with bidirectionality is indispesable for the proper treatment of text written with the Arabic or Hebrew script. But it is highly complex because user expectations may depend on context and are often difficult to identify and express. This memo deals with the specific problems of Internet identifiers containing rigth-to-left characters, hereafter called bidirectional identifiers. The basic paradigm of all modern bidirectional text handling solutions is the distinction between digital backing store, where text is stored in logical order, and rendering (display or printing), for which the necessary reordering is applied according to well-defined rules. 'Logical order' in this context is the order in which the characters in the text are pronounced or spelled out. Using logical order in the digital backing store simplifies a large number of operations, including sorting, searching, text-to-speech conversion, various other kinds of linguistic processing, input from keyboards and other devices, and rendering-related operations such as line breaking and text reflow. The alternative is to use display order even in the backing store, but this makes some of the operations above much more complex and others impossible. For general text (e.g. average prose,...), the Unicode bidirectional algorithm [UnicodeBidi] is the single widely accepted and used reference for providing this reordering from logical order to rendering. The Unicode bidirectional algorithm consists of an implicit part (producing adequate results in most cases) and explicit formatting characters for advanced cases. The Unicode bidirectional algorithm also allows higher-order protocols o overwrite certain aspects of the algorithm. A case where this has been done is the 'dir' attribute in [HTML4]. Bidirectional Internet identifiers primarily are used in three different contexts: 1) In visual form: This includes display on CTR and LCDs as well as more permanent visual forms such as printing. At least as far as the reading of individual components is concerned, the visual form has to use the inherent directionality of the characters used. Otherwise, identification, reading, transcription, and so on are severely affected. 2) In digital form inside running text (e.g. an IRI or an email address in an email or on a web page). It is not always easy or possible to distinguish identifiers from other text. 3) In digital form on its own (e.g. in a structured format or database of identifiers, or when transmitted for resolution). It should be possible to process bidirectional identifiers in the same way as other Internet identifiers. This memo addresses all these three cases as well as the conversion between them. The specifics of bidirectional text and of identifier structure make it impossible to design a solution that works without additional effort (when compared to non-bidirectional identifiers). However, the solution proposed in this memo is designed to make the best out of the severe constraints. 2. Notational Conventions Keywords in all upper-case such as MUST and SHOULD are defined in [RFC 2119]. For examples, lower-case letters are used for letters that flow left to right. Upper-case letters stand for letters that flow from right to left. A left-to-right example would be 'hello', whereas a right-to-left example would be 'OLLEH'. For bidirectional formatting characters from [Unicode], the [XML]-style entitiy notation is used, as follows: ‎ U+200E LEFT-TO-RIGHT MARK ‏ U+200F RIGHT-TO-LEFT MARK &lre; U+202A LEFT-TO-RIGHT EMBEDDING &rle; U+202B RIGHT-TO-LEFT EMBEDDING &pdf; U+202C POP DIRECTIONAL FORMATTING &lro; U+202D LEFT-TO-RIGHT OVERRIDE &rlo; U+202E RIGHT-TO-LEFT OVERRIDE Only the first two are defined in [HTML4]; the others are replaced by the 'dir' attribute and the element. 3. Identifier Structure Most Internet identifiers have an inherent structure that distinguishes structural characters (usually punctuation such as '@', '.', ':', '/', and so on) and payload components (usually formed with plain alphabetic or alphanumeric characters). In order to be able to process bidirectional identifiers in the same way as other identifiers, it is crucial that in the digital representations, the individual structural characters and identifier components are stored in the same sequence as for other identifiers. The main problem to solve for the visual representation of bidirectional identifiers is whether the general sequence of components and syntax characters should be from left to right or from right to left, i.e. whether the right-to-left equivalent of "ftp.example.com" should be "MOC.ELPMAXE.PTF" or "PTF.ELPMAXE.MOC". The former one may be seen as more natural in a purely right-to-left context. But there is also the possibility of mixed identifiers such as "PTF.ELPMAXE.com". These provide a very strong motivation for maintaining the same left-to-right overall component sequence for all Internet identifiers. The Unicode bidirectional algorithm, extremely simplified, tries to reorder continuous sequences of right-to-left characters between continuous sequences of left-to-right characters. A third category, called neutrals, is processed in the same way as surrounding characters. The main problem for identifiers is that all the structural characters are treated as neutrals by the Unicode algorithm, which means that they are moved around together with their context. As an example, the logical sequence FTP.EXAMPLE.com (corresponding to the example above), without additional care is displayed as ELPMAXE.PTF.com, which is obviously highly confusing. 4. Bidirectional Identifiers in Context 4.1 Independently Processed Bidirectional Identifiers Bidirectional identifiers processed independently, i.e. stored or transmitted for resolution, MUST be in full logical order both for the overall structure as well as for the individual components. They MUST conform directly to the relevant syntax rules. 4.2 Visual Rendering of Bidirectional Identifiers Bidirectional Identifiers MUST be rendered visually by rendering each component and each structural character from left to right. They MUST render each component according to its natural direction (i.e. left-to-right for components with left-to-right characters, right-to-left for components with right-to-left characters). 4.3 Bidirectional Identifiers in Textual Context In textual context, i.e. assuming rendering by the Unicode bidirectional algorithm, the backing store representation prescribed in Section 4.1 and the visual rendering prescribed in section 4.2 have to be combined. This is done as follows: - Each component with right-to-left characters is preceded and followed by an ‎. This left-to-right mark provides a left-to-right context to intervening syntactic characters. - If the overall context (base directionality) is right-to-left, the identifier is preceded by an &lre; and followed by a &pdf;. This makes sure that the components of the identifier are rendered in left-to-right order. This may also be done by using the equivalent features of a higher-order protocol (e.g. by using the dir='ltr' attribute in HTML). 4.4 Conversions Conversion from textual context to visual representation is done simply by applying the Unicode bidirectional algorithm, i.e. by passing the whole text to an appropriate rendering engine. Conversion from processing representation to textual context is done by adding the necessary formatting characters as described in Section 4.3. Conversion from textual context to processing representation is done by removing the formating characters at the positions described in Section 4.3. For international domain names, this can e.g. be integrated in [nameprep]. Conversion from visual representation to processing representation is done by inputting the identifier, component-by-component from left to right, using the natural reading order for each component. From these three conversions, the remaining conversions can be easily constructed. Any other procedure that leads to the same results is also allowed. 5. Restrictions The definitions and conversions in Section 4 only work under the following restrictions. 1) A component MUST NOT not use both right-to-left and left-to-right characters. 2) A component MUST NOT contain bidirectional formatting characters except for those and in those positions as defined in Section 4.3. 3) A component using right-to-left characters MUST NOT use any other class of characters (e.g. neutrals or numbers). Restrictions 1) and 2) are not very severe, in that they do not overly restrict useful identifiers. Also, trying to remove it would make it impossible for humans to predict the logical sequence of characters inside a single component. On the other hand, it would be very desirable to remove or at least soften restriction 3). Otherwise, it is impossible to combine Arabic or Hebrew letters with numbers, or to use a hyphen between two subcomponents of an Arabic component to avoid the cursive connection of the two subcomponents. To a certain extent, softening this restriction should be easily possible by adding additional formating characters in well defined ways similar to the provisions in Section 4.3. Feedback on this issue is particularly welcome. 6. Security Considerations Knowledge of deficiencies of a particular implementation of the above specification can allow somebody to pretend to resolve a particular identifier when indeed another identifier is being resolved. Acknowledgements The basic idea for the approach proposed in this memo are due to Francois Yergeau, and go back to around 1995. Discussions with Stephen Atkin, Paul Hoffman, and many others provided additional motivation and insight. Copyright Copyright (C) The Internet Society, 1997. All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." Author's address Martin J. Duerst W3C/Keio University 5322 Endo, Fujisawa 252-8520 Japan duerst@w3.org http://www.w3.org/People/D%C3%BCrst/ Tel/Fax: +81 466 49 1170 Note: Please write "Duerst" with u-umlaut wherever possible, e.g. as "Dürst" in XML and HTML. References [HTML4] "HTML 4.01", World Wide Web Consortium, . [IDN] Internationalized Domain Name (idn) IETF Working Group. For furter information, please see . [IRI] L. Masinter, M. Duerst, "Internationalized Resource Identifiers (IRI)", Internet Draft, Jan. 2001, , work in progress. [ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane, Oct. 2000, with amendments. [Nameprep] P. Hoffman, M. Blanchet, "Preparation of Internationalized Host Names", Internet Draft, Feb. 2001, , work in progress. [RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", March 1997. [Unicode] The Unicode Consortium, "The Unicode Standard, Version 3.1", consisting of: "The Unicode Standard, Version 3.0", Addison-Wesley, Reading, MA, 2000, and "Unicode Standard Annex #27: Unicode 3.1", , May 2001. [UnicodeBidi] The Unicode Consortium, "The Unicode Standard, Version 3.0", Addison-Wesley, Reading, MA, 2000, Section 3.12, pp. 55-69, also available at and "Unicode Standard Annex #9: The Bidirectional Algorithm", , March 2001. [W3C IRI] Internationalization - URIs and other identifiers . [XML] "XML 1.0", World Wide Web Consortium Recommendation, .