idnits 2.17.1 draft-ietf-iri-bidi-guidelines-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 2, 2012) is 4438 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'ASCII' is defined on line 343, but no explicit reference was found in the text == Unused Reference: 'ISO10646' is defined on line 347, but no explicit reference was found in the text == Unused Reference: 'RFC3491' is defined on line 360, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNI9' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV6' -- Duplicate reference: RFC3987, mentioned in 'RFC3987', was also mentioned in 'RFC3987bis'. Summary: 2 errors (**), 0 flaws (~~), 6 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internationalized Resource Identifiers M. Duerst 3 (iri) Aoyama Gakuin University 4 Internet-Draft L. Masinter 5 Intended status: BCP Adobe 6 Expires: September 3, 2012 A. Allawi 7 Diwan Software Limited 8 March 2, 2012 10 Guidelines for Internationalized Resource Identifiers with Bi- 11 directional Characters (Bidi IRIs) 12 draft-ietf-iri-bidi-guidelines-01 14 Abstract 16 This specification gives guidelines for selection, use, presentation 17 of International Resource Identifiers (IRI) which include characters 18 with in inherent right-to-left (rtl) writing direction. 20 Status of this Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on September 3, 2012. 37 Copyright Notice 39 Copyright (c) 2012 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 This document may contain material from IETF Documents or IETF 53 Contributions published or made publicly available before November 54 10, 2008. The person(s) controlling the copyright in some of this 55 material may not have granted the IETF Trust the right to allow 56 modifications of such material outside the IETF Standards Process. 57 Without obtaining an adequate license from the person(s) controlling 58 the copyright in such materials, this document may not be modified 59 outside the IETF Standards Process, and derivative works of it may 60 not be created outside the IETF Standards Process, except to format 61 it for publication as an RFC or to translate it into languages other 62 than English. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 67 1.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . . 3 68 2. Logical Storage and Visual Presentation . . . . . . . . . . . . 3 69 3. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . . 4 70 4. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . . 6 71 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 72 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 8 73 7. Security Considerations . . . . . . . . . . . . . . . . . . . . 8 74 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 8 75 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 8 76 9.1. Normative References . . . . . . . . . . . . . . . . . . . 8 77 9.2. Informative References . . . . . . . . . . . . . . . . . . 9 78 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 9 80 1. Introduction 82 Some UCS characters, such as those used in the Arabic and Hebrew 83 scripts, have an inherent right-to-left (rtl) writing direction. 84 IRIs containing these characters (called bidirectional IRIs or Bidi 85 IRIs) require additional attention because of the non-trivial 86 relation between logical representation (used for digital 87 representation and for reading/spelling) and visual representation 88 (used for display/printing). 90 Because of the complex interaction between the logical 91 representation, the visual representation, and the syntax of a Bidi 92 IRI, a balance is needed between various requirements. The main 93 requirements are 95 1. user-predictable conversion between visual and logical 96 representation; 98 2. the ability to include a wide range of characters in various parts 99 of the IRI; and 101 3. minor or no changes or restrictions for implementations. 103 1.1. Notation 105 In this document, Bidi Notation is used for bidirectional examples: 106 Lower case letters stand for Latin letters or other letters that are 107 written left to right, whereas upper case letters represent Arabic or 108 Hebrew letters that are written right to left. 110 In this document, the key words "MUST", "MUST NOT", "REQUIRED", 111 "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 112 and "OPTIONAL" are to be interpreted as described in [RFC2119]. 114 2. Logical Storage and Visual Presentation 116 When stored or transmitted in digital representation, bidirectional 117 IRIs MUST be in full logical order and MUST conform to the IRI syntax 118 rules (which includes the rules relevant to their scheme). This 119 ensures that bidirectional IRIs can be processed in the same way as 120 other IRIs. 122 Bidirectional IRIs MUST be rendered by using the Unicode 123 Bidirectional Algorithm [UNIV6], [UNI9]. Bidirectional IRIs MUST be 124 rendered in the same way as they would be if they were in a left-to- 125 right embedding; i.e., as if they were preceded by U+202A, LEFT-TO- 126 RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL 127 FORMATTING (PDF). Setting the embedding direction can also be done 128 in a higher-level protocol (e.g., the dir='ltr' attribute in HTML). 130 There is no requirement to use the above embedding if the display is 131 still the same without the embedding. For example, a bidirectional 132 IRI in a text with left-to-right base directionality (such as used 133 for English or Cyrillic) that is preceded and followed by whitespace 134 and strong left-to-right characters does not need an embedding. 135 Also, a bidirectional relative IRI reference that only contains 136 strong right-to-left characters and weak characters and that starts 137 and ends with a strong right-to-left character and appears in a text 138 with right-to-left base directionality (such as used for Arabic or 139 Hebrew) and is preceded and followed by whitespace and strong 140 characters does not need an embedding. 142 In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be 143 sufficient to force the correct display behavior. However, the 144 details of the Unicode Bidirectional algorithm are not always easy to 145 understand. Implementers are strongly advised to err on the side of 146 caution and to use embedding in all cases where they are not 147 completely sure that the display behavior is unaffected without the 148 embedding. 150 The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits 151 higher-level protocols to influence bidirectional rendering. Such 152 changes by higher-level protocols MUST NOT be used if they change the 153 rendering of IRIs. 155 The bidirectional formatting characters that may be used before or 156 after the IRI to ensure correct display are not themselves part of 157 the IRI. IRIs MUST NOT contain bidirectional formatting characters 158 (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual 159 rendering of the IRI but do not appear themselves. It would 160 therefore not be possible to input an IRI with such characters 161 correctly. 163 3. Bidi IRI Structure 165 The Unicode Bidirectional Algorithm is designed mainly for running 166 text. To make sure that it does not affect the rendering of 167 bidirectional IRIs too much, some restrictions on bidirectional IRIs 168 are necessary. These restrictions are given in terms of delimiters 169 (structural characters, mostly punctuation such as "@", ".", ":", and 170 "/") and components (usually consisting mostly of letters and 171 digits). 173 The following syntax rules from the ABNF of [RFC3987bis] correspond 174 to components for the purpose of Bidi behavior: iuserinfo, ireg-name, 175 isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and 176 ifragment. 178 Specifications that define the syntax of any of the above components 179 MAY divide them further and define smaller parts to be components 180 according to this document. As an example, the restrictions of 181 [RFC3490] on bidirectional domain names correspond to treating each 182 label of a domain name as a component for schemes with ireg-name as a 183 domain name. Even where the components are not defined formally, it 184 may be helpful to think about some syntax in terms of components and 185 to apply the relevant restrictions. For example, for the usual name/ 186 value syntax in query parts, it is convenient to treat each name and 187 each value as a component. As another example, the extensions in a 188 resource name can be treated as separate components. 190 For each component, the following restrictions apply: 192 1. A component SHOULD NOT use both right-to-left and left-to-right 193 characters. 195 2. A component using right-to-left characters SHOULD start and end 196 with right-to-left characters. 198 The above restrictions are given as "SHOULD"s, rather than as 199 "MUST"s. For IRIs that are never presented visually, they are not 200 relevant. However, for IRIs in general, they are very important to 201 ensure consistent conversion between visual presentation and logical 202 representation, in both directions. 204 Note: In some components, the above restrictions may actually be 205 strictly enforced. For example, [RFC3490] requires that these 206 restrictions apply to the labels of a host name for those schemes 207 where ireg-name is a host name. In some other components (for 208 example, path components) following these restrictions may not be 209 too difficult. For other components, such as parts of the query 210 part, it may be very difficult to enforce the restrictions because 211 the values of query parameters may be arbitrary character 212 sequences. 214 If the above restrictions cannot be satisfied otherwise, the affected 215 component can always be mapped to URI notation using the general 216 percent-encoding of IRI components, as described in [RFC3987bis]. 217 Please note that the whole component has to be mapped (see also 218 Example 9 below). 220 4. Input of Bidi IRIs 222 Bidi input methods MUST generate Bidi IRIs in logical order while 223 rendering them according to Section 2. During input, rendering 224 SHOULD be updated after every new character is input to avoid end- 225 user confusion. 227 5. Examples 229 This section gives examples of bidirectional IRIs, in Bidi Notation. 230 It shows legal IRIs with the relationship between logical and visual 231 representation and explains how certain phenomena in this 232 relationship may look strange to somebody not familiar with 233 bidirectional behavior, but familiar to users of Arabic and Hebrew. 234 It also shows what happens if the restrictions given in Section 3 are 235 not followed. The examples below can be seen at [BidiEx], in Arabic, 236 Hebrew, and Bidi Notation variants. 238 To read the bidi text in the examples, read the visual representation 239 from left to right until you encounter a block of rtl text. Read the 240 rtl block (including slashes and other special characters) from right 241 to left, then continue at the next unread ltr character. 243 Example 1: A single component with rtl characters is inverted: 244 Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html" 245 Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html" 246 Components can be read one by one, and each component can be read in 247 its natural direction. 249 Example 2: More than one consecutive component with rtl characters is 250 inverted as a whole: 251 Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html" 252 Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html" 253 A sequence of rtl components is read rtl, in the same way as a 254 sequence of rtl words is read rtl in a bidi text. 256 Example 3: All components of an IRI (except for the scheme) are rtl. 257 All rtl components are inverted overall: 258 Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV" 259 Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA" 260 The whole IRI (except the scheme) is read rtl. Delimiters between 261 rtl components stay between the respective components; delimiters 262 between ltr and rtl components don't move. 264 Example 4: Each of several sequences of rtl components is inverted on 265 its own: 266 Logical representation: "http://AB.CD.ef/gh/IJ/KL.html" 267 Visual representation: "http://DC.BA.ef/gh/LK/JI.html" 268 Each sequence of rtl components is read rtl, in the same way as each 269 sequence of rtl words in an ltr text is read rtl. 271 Example 5: Example 2, applied to components of different kinds: 272 Logical representation: "http://ab.cd.EF/GH/ij/kl.html" 273 Visual representation: "http://ab.cd.HG/FE/ij/kl.html" 274 The inversion of the domain name label and the path component may be 275 unexpected, but it is consistent with other bidi behavior. For 276 reassurance that the domain component really is "ab.cd.EF", it may be 277 helpful to read aloud the visual representation following the bidi 278 algorithm. After "http://ab.cd." one reads the RTL block 279 "E-F-slash-G-H", which corresponds to the logical representation. 281 Example 6: Same as Example 5, with more rtl components: 282 Logical representation: "http://ab.CD.EF/GH/IJ/kl.html" 283 Visual representation: "http://ab.JI/HG/FE.DC/kl.html" 284 The inversion of the domain name labels and the path components may 285 be easier to identify because the delimiters also move. 287 Example 7: A single rtl component includes digits: 288 Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html" 289 Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html" 290 Numbers are written ltr in all cases but are treated as an additional 291 embedding inside a run of rtl characters. This is completely 292 consistent with usual bidirectional text. 294 Example 8 (not allowed): Numbers are at the start or end of an rtl 295 component: 296 Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html" 297 Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html" 298 The sequence "1/2" is interpreted by the bidi algorithm as a 299 fraction, fragmenting the components and leading to confusion. There 300 are other characters that are interpreted in a special way close to 301 numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":". 303 Example 9 (not allowed): The numbers in the previous example are 304 percent-encoded: 305 Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html", 306 Visual representation: "http://ab.cd.ef/LK/JI%32/%31HG.html" 308 Example 10 (allowed but not recommended): 309 Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html" 310 Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html" 311 Components consisting of only numbers are allowed (it would be rather 312 difficult to prohibit them), but these may interact with adjacent RTL 313 components in ways that are not easy to predict. 315 Example 11 (allowed but not recommended): 316 Logical representation: "http://ab.CDEFGH.123ij/kl/mn/op.html" 317 Visual representation: "http://ab.123.HGFEDCij/kl/mn/op.html" 318 Components consisting of numbers and left-to-right characters are 319 allowed, but these may interact with adjacent RTL components in ways 320 that are not easy to predict. 322 6. IANA Considerations 324 This document makes no changes to IANA registries. 326 7. Security Considerations 328 Confusion can occur with bidirectional IRIs, if the restrictions in 329 Section 3 are not followed. The same visual representation may be 330 interpreted as different logical representations, and vice versa. It 331 is also very important that a correct Unicode bidirectional 332 implementation be used. 334 8. Acknowledgements 336 This document was derived from [RFC3987] and [RFC3987bis] and the 337 acknowledgments of those documents apply. 339 9. References 341 9.1. Normative References 343 [ASCII] American National Standards Institute, "Coded Character 344 Set -- 7-bit American Standard Code for Information 345 Interchange", ANSI X3.4, 1986. 347 [ISO10646] 348 International Organization for Standardization, "ISO/IEC 349 10646:2003: Information Technology - Universal Multiple- 350 Octet Coded Character Set (UCS)", ISO Standard 10646, 351 December 2003. 353 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 354 Requirement Levels", BCP 14, RFC 2119, March 1997. 356 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 357 "Internationalizing Domain Names in Applications (IDNA)", 358 RFC 3490, March 2003. 360 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 361 Profile for Internationalized Domain Names (IDN)", 362 RFC 3491, March 2003. 364 [RFC3987bis] 365 Duerst, M., Masinter, L., and M. Suignard, 366 "Internationalized Resource Identifiers (IRIs)", 367 August 2011, 368 . 370 [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard 371 Annex #9, March 2004, 372 . 374 [UNIV6] The Unicode Consortium, "The Unicode Standard, Version 375 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, 376 ISBN 978-1-936213-01-6)", October 2010. 378 9.2. Informative References 380 [BidiEx] "Examples of bidirectional IRIs", 381 . 383 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 384 Identifiers (IRIs)", RFC 3987, January 2005. 386 Authors' Addresses 388 Martin Duerst 389 Aoyama Gakuin University 390 5-10-1 Fuchinobe 391 Sagamihara, Kanagawa 229-8558 392 Japan 394 Phone: +81 42 759 6329 395 Fax: +81 42 759 6495 396 Email: duerst@it.aoyama.ac.jp 397 URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ 398 Larry Masinter 399 Adobe 400 345 Park Ave 401 San Jose, CA 95110 402 U.S.A. 404 Phone: +1-408-536-3024 405 Email: masinter@adobe.com 406 URI: http://larry.masinter.net 408 Adil Allawi 409 Diwan Software Limited 410 37-39 Peckham Road 411 London SE5 8UH 412 United Kingdom 414 Phone: +44 7718 785850 415 Fax: +44 20 72525444 416 Email: adil@diwan.com 417 URI: http://ironymark.diwan.com/