idnits 2.17.1 draft-ietf-iri-bidi-guidelines-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 9, 2012) is 4431 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'ASCII' is defined on line 358, but no explicit reference was found in the text == Unused Reference: 'ISO10646' is defined on line 362, but no explicit reference was found in the text == Unused Reference: 'RFC3491' is defined on line 375, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNI9' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV6' -- Duplicate reference: RFC3987, mentioned in 'RFC3987', was also mentioned in 'RFC3987bis'. Summary: 2 errors (**), 0 flaws (~~), 6 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internationalized Resource Identifiers M. Duerst 3 (iri) Aoyama Gakuin University 4 Internet-Draft L. Masinter 5 Intended status: BCP Adobe 6 Expires: September 10, 2012 A. Allawi 7 Diwan Software Limited 8 March 9, 2012 10 Guidelines for Internationalized Resource Identifiers with Bi- 11 directional Characters (Bidi IRIs) 12 draft-ietf-iri-bidi-guidelines-02 14 Abstract 16 This specification gives guidelines for selection, use, and 17 presentation of International Resource Identifiers (IRIs) which 18 include characters with inherent right-to-left (rtl) writing 19 direction. 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on September 10, 2012. 38 Copyright Notice 40 Copyright (c) 2012 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 This document may contain material from IETF Documents or IETF 54 Contributions published or made publicly available before November 55 10, 2008. The person(s) controlling the copyright in some of this 56 material may not have granted the IETF Trust the right to allow 57 modifications of such material outside the IETF Standards Process. 58 Without obtaining an adequate license from the person(s) controlling 59 the copyright in such materials, this document may not be modified 60 outside the IETF Standards Process, and derivative works of it may 61 not be created outside the IETF Standards Process, except to format 62 it for publication as an RFC or to translate it into languages other 63 than English. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . . 3 69 2. Logical Storage and Visual Presentation . . . . . . . . . . . . 3 70 3. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . . 5 71 4. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . . 6 72 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 73 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 8 74 7. Security Considerations . . . . . . . . . . . . . . . . . . . . 8 75 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 8 76 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 9 77 9.1. Normative References . . . . . . . . . . . . . . . . . . . 9 78 9.2. Informative References . . . . . . . . . . . . . . . . . . 9 79 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 9 81 1. Introduction 83 Some UCS characters, such as those used in the Arabic and Hebrew 84 scripts, have an inherent right-to-left (rtl) writing direction as 85 opposed to characters, such as those in Latin scripts, that have an 86 inherent left-to-right (ltr) direction. IRIs containing rtl 87 characters (called bidirectional IRIs or Bidi IRIs) require 88 additional attention because of the non-trivial relation between 89 their logical and visual ordering. The logical order represents the 90 order in which the characters are read and stored on computers. The 91 visual order represents the order the characters are drawn on a 92 computer display or printout in the way a human expects to read them. 94 Generally, alphabetic characters in scripts like Arabic and Hebrew 95 are drawn rtl while numbers are drawn ltr. Symbols, such as slash 96 '/' and period '.' take their visual direction from the surrounding 97 chracters. 99 Because of this complex interaction between the logical 100 representation, the visual representation, and the syntax of a Bidi 101 IRI, a balance is needed between various requirements. The main 102 requirements are: 104 1. user-predictable conversion between visual and logical 105 representation; 107 2. the ability to include a wide range of characters in various parts 108 of the IRI; and 110 3. minor or no changes or restrictions for implementations. 112 1.1. Notation 114 In this document, "Bidi Notation" is used for the given Bidi IRI 115 examples as follows: Lower case letters a-z stand for characters that 116 are written with a left to right ordering (such as Latin characters), 117 whereas upper case letters A-Z represent characters that are written 118 right to left (such as Arqbic or Hebrew characters). Numbers and 119 symbols are the same. 121 In this document, the key words "MUST", "MUST NOT", "REQUIRED", 122 "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 123 and "OPTIONAL" are to be interpreted as described in [RFC2119]. 125 2. Logical Storage and Visual Presentation 127 When stored or transmitted in digital representation, Bidi IRIs MUST 128 be in full logical order and MUST conform to the IRI syntax rules 129 (which includes the rules relevant to their scheme). This ensures 130 that Bidi IRIs can be processed in the same way as other IRIs. 132 Bidi IRIs MUST be visually ordered by the Unicode Bidirectional 133 Algorithm [UNIV6], [UNI9]. Bidi IRIs MUST be rendered in the same 134 way as they would be if they were in a left-to-right embedding. 136 In conformance with the Unicode Bidirectional Algorithm, embedding 137 MAY be done in one of two ways: 139 1. precede the IRI with U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and 140 follow with U+202C, POP DIRECTIONAL FORMATTING (PDF); or 142 2. use a higher-level protocol (e.g., the dir='ltr' attribute in 143 HTML). 145 Preceding and following the Bidi IRI with U+200E, LEFT-TO-RIGHT MARK 146 (LRM). Is NOT RECOMMENDED as, there are cases where this may not be 147 sufficient to match full left to right embedding. 149 There is no requirement to use embedding if the display is still the 150 same without the embedding. For example, a Bidi IRI in a text with 151 left-to-right base directionality (such as used for English or 152 Cyrillic) that is preceded and followed by whitespace and strong 153 left-to-right characters does not need an embedding. Also, a 154 bidirectional relative IRI reference that only contains strong right- 155 to-left characters and weak characters (such as symbols) and that 156 starts and ends with a strong right-to-left character and appears in 157 a text with right-to-left base directionality (such as used for 158 Arabic or Hebrew) and is preceded and followed by whitespace and 159 strong characters does not need an embedding. 161 However, Implementers are, RECOMMENDED to use embedding in all cases 162 where they are not completely sure that the display behavior is 163 unaffected without the embedding. 165 The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits 166 higher-level protocols to influence bidirectional rendering. Such 167 changes by higher-level protocols MUST NOT be used if they change the 168 rendering of IRIs. 170 The bidirectional formatting characters that may be used before or 171 after the IRI to ensure correct display are not themselves part of 172 the IRI. IRIs MUST NOT contain bidirectional formatting characters 173 (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual 174 rendering of the IRI but do not appear themselves. It would 175 therefore not be possible to input an IRI with such characters 176 correctly. 178 3. Bidi IRI Structure 180 The Unicode Bidirectional Algorithm is designed mainly for plain 181 text. To make sure that it does not affect the rendering of Bidi 182 IRIs outside of the requirements of this document, some restrictions 183 on Bidi IRIs are necessary. These restrictions are given in terms of 184 delimiters (structural characters, mostly punctuation such as "@", 185 ".", ":", and "/") and components (usually consisting mostly of 186 letters and digits). 188 The following syntax rules from the ABNF of [RFC3987bis] correspond 189 to components for the purpose of Bidi behavior: iuserinfo, ireg-name, 190 isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and 191 ifragment. 193 Specifications that define the syntax of any of the above components 194 MAY divide them further and define smaller parts to be components 195 according to this document. As an example, the restrictions of 196 [RFC3490] on bidirectional domain names correspond to treating each 197 label of a domain name as a component for schemes with ireg-name as a 198 domain name. Even where the components are not defined formally, it 199 may be helpful to think about some syntax in terms of components and 200 to apply the relevant restrictions. For example, for the usual name/ 201 value syntax in query parts, it is convenient to treat each name and 202 each value as a component. As another example, the extensions in a 203 resource name can be treated as separate components. 205 For each component, the following restrictions apply: 207 1. A component SHOULD NOT use both right-to-left and left-to-right 208 characters. 210 2. A component using right-to-left characters SHOULD start and end 211 with right-to-left characters. 213 The above restrictions are given as "SHOULD"s, rather than as 214 "MUST"s. For IRIs that are never presented visually, they are not 215 relevant. However, for IRIs in general, they are very important to 216 ensure consistent conversion between visual presentation and logical 217 representation, in both directions. 219 Note: In some components, the above restrictions may actually be 220 strictly enforced. For example, [RFC3490] requires that these 221 restrictions apply to the labels of a host name for those schemes 222 where ireg-name is a host name. In some other components (for 223 example, path components) following these restrictions may not be 224 too difficult. For other components, such as parts of the query 225 part, it may be very difficult to enforce the restrictions because 226 the values of query parameters may be arbitrary character 227 sequences. 229 If the above restrictions cannot be satisfied otherwise, the affected 230 component can always be mapped to URI notation using the general 231 percent-encoding of IRI components, as described in [RFC3987bis]. 232 Please note that the whole component has to be mapped (see also 233 Example 9 below). 235 4. Input of Bidi IRIs 237 Bidi input methods MUST generate Bidi IRIs in logical order while 238 rendering them according to Section 2. During input, rendering 239 SHOULD be updated after every new character is input to avoid end- 240 user confusion. 242 5. Examples 244 This section gives examples of Bidi IRIs in Bidi Notation. It shows 245 legal IRIs with the relationship between their logical and visual 246 representation and explains how certain phenomena in this 247 relationship may look strange to somebody not familiar with 248 bidirectional behavior, but familiar to users of Arabic and Hebrew. 249 It also shows what happens if the restrictions given in Section 3 are 250 not followed. The examples below can be seen at [BidiEx], in Arabic, 251 Hebrew, and Bidi Notation variants. 253 To read the bidi text in the examples, read the visual representation 254 from left to right until you encounter a block of rtl text. Read the 255 rtl block (including slashes and other special characters) from right 256 to left, then continue at the next unread ltr character. 258 Example 1: A single component with rtl characters is inverted: 259 Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html" 260 Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html" 261 Components can be read one by one, and each component can be read in 262 its natural direction. 264 Example 2: More than one consecutive component with rtl characters is 265 inverted as a whole: 266 Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html" 267 Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html" 268 A sequence of rtl components is read rtl, in the same way as a 269 sequence of rtl words is read rtl in a bidi text. 271 Example 3: All components of an IRI (except for the scheme) are rtl. 272 All rtl components are inverted overall: 273 Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV" 274 Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA" 275 The whole IRI (except the scheme) is read rtl. Delimiters between 276 rtl components stay between the respective components; delimiters 277 between ltr and rtl components don't move. 279 Example 4: Each of several sequences of rtl components is inverted on 280 its own: 281 Logical representation: "http://AB.CD.ef/gh/IJ/KL.html" 282 Visual representation: "http://DC.BA.ef/gh/LK/JI.html" 283 Each sequence of rtl components is read rtl, in the same way as each 284 sequence of rtl words in an ltr text is read rtl. 286 Example 5: Example 2, applied to components of different kinds: 287 Logical representation: "http://ab.cd.EF/GH/ij/kl.html" 288 Visual representation: "http://ab.cd.HG/FE/ij/kl.html" 289 The inversion of the domain name label and the path component may be 290 unexpected, but it is consistent with other bidi behavior. For 291 reassurance that the domain component really is "ab.cd.EF", it may be 292 helpful to read aloud the visual representation following the Unicode 293 Bidirectional Algorithm. After "http://ab.cd." one reads the RTL 294 block "E-F-slash-G-H", which corresponds to the logical 295 representation. 297 Example 6: Same as Example 5, with more rtl components: 298 Logical representation: "http://ab.CD.EF/GH/IJ/kl.html" 299 Visual representation: "http://ab.JI/HG/FE.DC/kl.html" 300 The inversion of the domain name labels and the path components may 301 be easier to identify because the delimiters also move. 303 Example 7: A single rtl component includes digits: 304 Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html" 305 Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html" 306 Numbers are written ltr in all cases but are treated as an additional 307 embedding inside a run of rtl characters. This is completely 308 consistent with usual bidirectional text. 310 Example 8 (not allowed): Numbers are at the start or end of an rtl 311 component: 312 Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html" 313 Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html" 314 The sequence "1/2" is interpreted by the Bidirectional Algorithm as a 315 fraction, fragmenting the components and leading to confusion. There 316 are other characters that are interpreted in a special way close to 317 numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":". 319 Example 9 (not allowed): The numbers in the previous example are 320 percent-encoded: 321 Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html", 322 Visual representation: "http://ab.cd.ef/LK/JI%32/%31HG.html" 324 Example 10 (allowed but not recommended): 325 Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html" 326 Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html" 327 Components consisting of only numbers are allowed (it would be rather 328 difficult to prohibit them), but these may interact with adjacent RTL 329 components in ways that are not easy to predict. 331 Example 11 (allowed but not recommended): 332 Logical representation: "http://ab.CDEFGH.123ij/kl/mn/op.html" 333 Visual representation: "http://ab.123.HGFEDCij/kl/mn/op.html" 334 Components consisting of numbers and left-to-right characters are 335 allowed, but these may interact with adjacent RTL components in ways 336 that are not easy to predict. 338 6. IANA Considerations 340 This document makes no changes to IANA registries. 342 7. Security Considerations 344 Confusion can occur with bidirectional IRIs, if the restrictions in 345 Section 3 are not followed. The same visual representation may be 346 interpreted as different logical representations, and vice versa. It 347 is also very important that a correct Unicode bidirectional 348 implementation be used. 350 8. Acknowledgements 352 This document was derived from [RFC3987] and [RFC3987bis] and the 353 acknowledgments of those documents apply. 355 9. References 356 9.1. Normative References 358 [ASCII] American National Standards Institute, "Coded Character 359 Set -- 7-bit American Standard Code for Information 360 Interchange", ANSI X3.4, 1986. 362 [ISO10646] 363 International Organization for Standardization, "ISO/IEC 364 10646:2003: Information Technology - Universal Multiple- 365 Octet Coded Character Set (UCS)", ISO Standard 10646, 366 December 2003. 368 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 369 Requirement Levels", BCP 14, RFC 2119, March 1997. 371 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 372 "Internationalizing Domain Names in Applications (IDNA)", 373 RFC 3490, March 2003. 375 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 376 Profile for Internationalized Domain Names (IDN)", 377 RFC 3491, March 2003. 379 [RFC3987bis] 380 Duerst, M., Masinter, L., and M. Suignard, 381 "Internationalized Resource Identifiers (IRIs)", 382 August 2011, 383 . 385 [UNI9] Davis, M., "The Unicode Bidirectional Algorithm", Unicode 386 Standard Annex #9, March 2004, 387 . 389 [UNIV6] The Unicode Consortium, "The Unicode Standard, Version 390 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, 391 ISBN 978-1-936213-01-6)", October 2010. 393 9.2. Informative References 395 [BidiEx] "Examples of Bidi IRIs", 396 . 398 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 399 Identifiers (IRIs)", RFC 3987, January 2005. 401 Authors' Addresses 403 Martin Duerst 404 Aoyama Gakuin University 405 5-10-1 Fuchinobe 406 Sagamihara, Kanagawa 229-8558 407 Japan 409 Phone: +81 42 759 6329 410 Fax: +81 42 759 6495 411 Email: duerst@it.aoyama.ac.jp 412 URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ 414 Larry Masinter 415 Adobe 416 345 Park Ave 417 San Jose, CA 95110 418 U.S.A. 420 Phone: +1-408-536-3024 421 Email: masinter@adobe.com 422 URI: http://larry.masinter.net 424 Adil Allawi 425 Diwan Software Limited 426 37-39 Peckham Road 427 London SE5 8UH 428 United Kingdom 430 Phone: +44 7718 785850 431 Fax: +44 20 72525444 432 Email: adil@diwan.com 433 URI: http://ironymark.diwan.com/