idnits 2.17.1 draft-ietf-iri-bidi-guidelines-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 14, 2011) is 4639 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'ASCII' is defined on line 341, but no explicit reference was found in the text == Unused Reference: 'ISO10646' is defined on line 345, but no explicit reference was found in the text == Unused Reference: 'RFC3491' is defined on line 358, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' ** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891) ** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. 'UNI9' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNIV6' -- Duplicate reference: RFC3987, mentioned in 'RFC3987', was also mentioned in 'RFC3987bis'. Summary: 2 errors (**), 0 flaws (~~), 6 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internationalized Resource Identifiers M. Duerst 3 (iri) Aoyama Gakuin University 4 Internet-Draft L. Masinter 5 Intended status: BCP Adobe 6 Expires: February 15, 2012 August 14, 2011 8 Guidelines for Internationalized Resource Identifiers with Bi- 9 directional Characters (Bidi IRIs) 10 draft-ietf-iri-bidi-guidelines-00 12 Abstract 14 This specification gives guidelines for selection, use, presentation 15 of International Resource Identifiers (IRI) which include characters 16 with in inherent right-to-left (rtl) writing direction. 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at http://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on February 15, 2012. 35 Copyright Notice 37 Copyright (c) 2011 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (http://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 This document may contain material from IETF Documents or IETF 51 Contributions published or made publicly available before November 52 10, 2008. The person(s) controlling the copyright in some of this 53 material may not have granted the IETF Trust the right to allow 54 modifications of such material outside the IETF Standards Process. 55 Without obtaining an adequate license from the person(s) controlling 56 the copyright in such materials, this document may not be modified 57 outside the IETF Standards Process, and derivative works of it may 58 not be created outside the IETF Standards Process, except to format 59 it for publication as an RFC or to translate it into languages other 60 than English. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 65 1.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . . 3 66 2. Logical Storage and Visual Presentation . . . . . . . . . . . . 3 67 3. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . . 4 68 4. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . . 6 69 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 70 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 8 71 7. Security Considerations . . . . . . . . . . . . . . . . . . . . 8 72 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 8 73 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 8 74 9.1. Normative References . . . . . . . . . . . . . . . . . . . 8 75 9.2. Informative References . . . . . . . . . . . . . . . . . . 9 76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 9 78 1. Introduction 80 Some UCS characters, such as those used in the Arabic and Hebrew 81 scripts, have an inherent right-to-left (rtl) writing direction. 82 IRIs containing these characters (called bidirectional IRIs or Bidi 83 IRIs) require additional attention because of the non-trivial 84 relation between logical representation (used for digital 85 representation and for reading/spelling) and visual representation 86 (used for display/printing). 88 Because of the complex interaction between the logical 89 representation, the visual representation, and the syntax of a Bidi 90 IRI, a balance is needed between various requirements. The main 91 requirements are 93 1. user-predictable conversion between visual and logical 94 representation; 96 2. the ability to include a wide range of characters in various parts 97 of the IRI; and 99 3. minor or no changes or restrictions for implementations. 101 1.1. Notation 103 In this document, Bidi Notation is used for bidirectional examples: 104 Lower case letters stand for Latin letters or other letters that are 105 written left to right, whereas upper case letters represent Arabic or 106 Hebrew letters that are written right to left. 108 In this document, the key words "MUST", "MUST NOT", "REQUIRED", 109 "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 110 and "OPTIONAL" are to be interpreted as described in [RFC2119]. 112 2. Logical Storage and Visual Presentation 114 When stored or transmitted in digital representation, bidirectional 115 IRIs MUST be in full logical order and MUST conform to the IRI syntax 116 rules (which includes the rules relevant to their scheme). This 117 ensures that bidirectional IRIs can be processed in the same way as 118 other IRIs. 120 Bidirectional IRIs MUST be rendered by using the Unicode 121 Bidirectional Algorithm [UNIV6], [UNI9]. Bidirectional IRIs MUST be 122 rendered in the same way as they would be if they were in a left-to- 123 right embedding; i.e., as if they were preceded by U+202A, LEFT-TO- 124 RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL 125 FORMATTING (PDF). Setting the embedding direction can also be done 126 in a higher-level protocol (e.g., the dir='ltr' attribute in HTML). 128 There is no requirement to use the above embedding if the display is 129 still the same without the embedding. For example, a bidirectional 130 IRI in a text with left-to-right base directionality (such as used 131 for English or Cyrillic) that is preceded and followed by whitespace 132 and strong left-to-right characters does not need an embedding. 133 Also, a bidirectional relative IRI reference that only contains 134 strong right-to-left characters and weak characters and that starts 135 and ends with a strong right-to-left character and appears in a text 136 with right-to-left base directionality (such as used for Arabic or 137 Hebrew) and is preceded and followed by whitespace and strong 138 characters does not need an embedding. 140 In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be 141 sufficient to force the correct display behavior. However, the 142 details of the Unicode Bidirectional algorithm are not always easy to 143 understand. Implementers are strongly advised to err on the side of 144 caution and to use embedding in all cases where they are not 145 completely sure that the display behavior is unaffected without the 146 embedding. 148 The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits 149 higher-level protocols to influence bidirectional rendering. Such 150 changes by higher-level protocols MUST NOT be used if they change the 151 rendering of IRIs. 153 The bidirectional formatting characters that may be used before or 154 after the IRI to ensure correct display are not themselves part of 155 the IRI. IRIs MUST NOT contain bidirectional formatting characters 156 (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual 157 rendering of the IRI but do not appear themselves. It would 158 therefore not be possible to input an IRI with such characters 159 correctly. 161 3. Bidi IRI Structure 163 The Unicode Bidirectional Algorithm is designed mainly for running 164 text. To make sure that it does not affect the rendering of 165 bidirectional IRIs too much, some restrictions on bidirectional IRIs 166 are necessary. These restrictions are given in terms of delimiters 167 (structural characters, mostly punctuation such as "@", ".", ":", and 168 "/") and components (usually consisting mostly of letters and 169 digits). 171 The following syntax rules from the ABNF of [RFC3987bis] correspond 172 to components for the purpose of Bidi behavior: iuserinfo, ireg-name, 173 isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and 174 ifragment. 176 Specifications that define the syntax of any of the above components 177 MAY divide them further and define smaller parts to be components 178 according to this document. As an example, the restrictions of 179 [RFC3490] on bidirectional domain names correspond to treating each 180 label of a domain name as a component for schemes with ireg-name as a 181 domain name. Even where the components are not defined formally, it 182 may be helpful to think about some syntax in terms of components and 183 to apply the relevant restrictions. For example, for the usual name/ 184 value syntax in query parts, it is convenient to treat each name and 185 each value as a component. As another example, the extensions in a 186 resource name can be treated as separate components. 188 For each component, the following restrictions apply: 190 1. A component SHOULD NOT use both right-to-left and left-to-right 191 characters. 193 2. A component using right-to-left characters SHOULD start and end 194 with right-to-left characters. 196 The above restrictions are given as "SHOULD"s, rather than as 197 "MUST"s. For IRIs that are never presented visually, they are not 198 relevant. However, for IRIs in general, they are very important to 199 ensure consistent conversion between visual presentation and logical 200 representation, in both directions. 202 Note: In some components, the above restrictions may actually be 203 strictly enforced. For example, [RFC3490] requires that these 204 restrictions apply to the labels of a host name for those schemes 205 where ireg-name is a host name. In some other components (for 206 example, path components) following these restrictions may not be 207 too difficult. For other components, such as parts of the query 208 part, it may be very difficult to enforce the restrictions because 209 the values of query parameters may be arbitrary character 210 sequences. 212 If the above restrictions cannot be satisfied otherwise, the affected 213 component can always be mapped to URI notation using the general 214 percent-encoding of IRI components, as described in [RFC3987bis]. 215 Please note that the whole component has to be mapped (see also 216 Example 9 below). 218 4. Input of Bidi IRIs 220 Bidi input methods MUST generate Bidi IRIs in logical order while 221 rendering them according to Section 2. During input, rendering 222 SHOULD be updated after every new character is input to avoid end- 223 user confusion. 225 5. Examples 227 This section gives examples of bidirectional IRIs, in Bidi Notation. 228 It shows legal IRIs with the relationship between logical and visual 229 representation and explains how certain phenomena in this 230 relationship may look strange to somebody not familiar with 231 bidirectional behavior, but familiar to users of Arabic and Hebrew. 232 It also shows what happens if the restrictions given in Section 3 are 233 not followed. The examples below can be seen at [BidiEx], in Arabic, 234 Hebrew, and Bidi Notation variants. 236 To read the bidi text in the examples, read the visual representation 237 from left to right until you encounter a block of rtl text. Read the 238 rtl block (including slashes and other special characters) from right 239 to left, then continue at the next unread ltr character. 241 Example 1: A single component with rtl characters is inverted: 242 Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html" 243 Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html" 244 Components can be read one by one, and each component can be read in 245 its natural direction. 247 Example 2: More than one consecutive component with rtl characters is 248 inverted as a whole: 249 Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html" 250 Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html" 251 A sequence of rtl components is read rtl, in the same way as a 252 sequence of rtl words is read rtl in a bidi text. 254 Example 3: All components of an IRI (except for the scheme) are rtl. 255 All rtl components are inverted overall: 256 Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV" 257 Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA" 258 The whole IRI (except the scheme) is read rtl. Delimiters between 259 rtl components stay between the respective components; delimiters 260 between ltr and rtl components don't move. 262 Example 4: Each of several sequences of rtl components is inverted on 263 its own: 264 Logical representation: "http://AB.CD.ef/gh/IJ/KL.html" 265 Visual representation: "http://DC.BA.ef/gh/LK/JI.html" 266 Each sequence of rtl components is read rtl, in the same way as each 267 sequence of rtl words in an ltr text is read rtl. 269 Example 5: Example 2, applied to components of different kinds: 270 Logical representation: "http://ab.cd.EF/GH/ij/kl.html" 271 Visual representation: "http://ab.cd.HG/FE/ij/kl.html" 272 The inversion of the domain name label and the path component may be 273 unexpected, but it is consistent with other bidi behavior. For 274 reassurance that the domain component really is "ab.cd.EF", it may be 275 helpful to read aloud the visual representation following the bidi 276 algorithm. After "http://ab.cd." one reads the RTL block 277 "E-F-slash-G-H", which corresponds to the logical representation. 279 Example 6: Same as Example 5, with more rtl components: 280 Logical representation: "http://ab.CD.EF/GH/IJ/kl.html" 281 Visual representation: "http://ab.JI/HG/FE.DC/kl.html" 282 The inversion of the domain name labels and the path components may 283 be easier to identify because the delimiters also move. 285 Example 7: A single rtl component includes digits: 286 Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html" 287 Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html" 288 Numbers are written ltr in all cases but are treated as an additional 289 embedding inside a run of rtl characters. This is completely 290 consistent with usual bidirectional text. 292 Example 8 (not allowed): Numbers are at the start or end of an rtl 293 component: 294 Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html" 295 Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html" 296 The sequence "1/2" is interpreted by the bidi algorithm as a 297 fraction, fragmenting the components and leading to confusion. There 298 are other characters that are interpreted in a special way close to 299 numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":". 301 Example 9 (not allowed): The numbers in the previous example are 302 percent-encoded: 303 Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html", 304 Visual representation: "http://ab.cd.ef/LK/JI%32/%31HG.html" 306 Example 10 (allowed but not recommended): 307 Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html" 308 Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html" 309 Components consisting of only numbers are allowed (it would be rather 310 difficult to prohibit them), but these may interact with adjacent RTL 311 components in ways that are not easy to predict. 313 Example 11 (allowed but not recommended): 314 Logical representation: "http://ab.CDEFGH.123ij/kl/mn/op.html" 315 Visual representation: "http://ab.123.HGFEDCij/kl/mn/op.html" 316 Components consisting of numbers and left-to-right characters are 317 allowed, but these may interact with adjacent RTL components in ways 318 that are not easy to predict. 320 6. IANA Considerations 322 This document makes no changes to IANA registries. 324 7. Security Considerations 326 Confusion can occur with bidirectional IRIs, if the restrictions in 327 Section 3 are not followed. The same visual representation may be 328 interpreted as different logical representations, and vice versa. It 329 is also very important that a correct Unicode bidirectional 330 implementation be used. 332 8. Acknowledgements 334 This document was derived from [RFC3987] and [RFC3987bis] and the 335 acknowledgments of those documents apply. 337 9. References 339 9.1. Normative References 341 [ASCII] American National Standards Institute, "Coded Character 342 Set -- 7-bit American Standard Code for Information 343 Interchange", ANSI X3.4, 1986. 345 [ISO10646] 346 International Organization for Standardization, "ISO/IEC 347 10646:2003: Information Technology - Universal Multiple- 348 Octet Coded Character Set (UCS)", ISO Standard 10646, 349 December 2003. 351 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 352 Requirement Levels", BCP 14, RFC 2119, March 1997. 354 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 355 "Internationalizing Domain Names in Applications (IDNA)", 356 RFC 3490, March 2003. 358 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 359 Profile for Internationalized Domain Names (IDN)", 360 RFC 3491, March 2003. 362 [RFC3987bis] 363 Duerst, M., Masinter, L., and M. Suignard, 364 "Internationalized Resource Identifiers (IRIs)", 365 August 2011, 366 . 368 [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard 369 Annex #9, March 2004, 370 . 372 [UNIV6] The Unicode Consortium, "The Unicode Standard, Version 373 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, 374 ISBN 978-1-936213-01-6)", October 2010. 376 9.2. Informative References 378 [BidiEx] "Examples of bidirectional IRIs", 379 . 381 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 382 Identifiers (IRIs)", RFC 3987, January 2005. 384 Authors' Addresses 386 Martin Duerst 387 Aoyama Gakuin University 388 5-10-1 Fuchinobe 389 Sagamihara, Kanagawa 229-8558 390 Japan 392 Phone: +81 42 759 6329 393 Fax: +81 42 759 6495 394 Email: duerst@it.aoyama.ac.jp 395 URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ 396 Larry Masinter 397 Adobe 398 345 Park Ave 399 San Jose, CA 95110 400 U.S.A. 402 Phone: +1-408-536-3024 403 Email: masinter@adobe.com 404 URI: http://larry.masinter.net