idnits 2.17.1 draft-ietf-idnabis-bidi-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 120: '... character MUST be the first char...' RFC 2119 keyword, line 121: '...dALCat character MUST be the last char...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 14, 2010) is 5215 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-13) exists of draft-ietf-idnabis-defs-12 -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX9' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode' == Outdated reference: A later version (-18) exists of draft-ietf-idnabis-protocol-17 -- Obsolete informational reference (is this intentional?): RFC 2672 (Obsoleted by RFC 6672) -- Obsolete informational reference (is this intentional?): RFC 3454 (Obsoleted by RFC 7564) Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group H. Alvestrand, Ed. 3 Internet-Draft Google 4 Intended status: Standards Track C. Karp 5 Expires: July 18, 2010 Swedish Museum of Natural History 6 January 14, 2010 8 Right-to-left scripts for IDNA 9 draft-ietf-idnabis-bidi-07 11 Status of this Memo 13 This Internet-Draft is submitted to IETF in full conformance with the 14 provisions of BCP 78 and BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on July 18, 2010. 34 Copyright Notice 36 Copyright (c) 2010 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents in effect on the date of 41 publication of this document (http://trustee.ietf.org/license-info). 42 Please review these documents carefully, as they describe your rights 43 and restrictions with respect to this document. 45 Abstract 47 The use of right-to-left scripts in internationalized domain names 48 has presented several challenges. This memo proposes a new BIDI rule 49 for IDNA labels, based on the encountered problems with some scripts, 50 and some shortcomings in the 2003 IDNA BIDI criterion. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 1.1. Purpose and applicability . . . . . . . . . . . . . . . . 3 56 1.2. Background and history . . . . . . . . . . . . . . . . . . 3 57 1.3. Structure of the rest of this document . . . . . . . . . . 4 58 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 59 2. The BIDI Rule . . . . . . . . . . . . . . . . . . . . . . . . 6 60 3. The requirement set for the BIDI rule . . . . . . . . . . . . 7 61 4. Examples of issues found with RFC 3454 . . . . . . . . . . . . 10 62 4.1. Dhivehi . . . . . . . . . . . . . . . . . . . . . . . . . 10 63 4.2. Yiddish . . . . . . . . . . . . . . . . . . . . . . . . . 10 64 4.3. Strings with numbers . . . . . . . . . . . . . . . . . . . 12 65 5. Troublesome situations and guidelines . . . . . . . . . . . . 12 66 6. Other issues in need of resolution . . . . . . . . . . . . . . 13 67 7. Compatibility considerations . . . . . . . . . . . . . . . . . 14 68 7.1. Backwards compatibility considerations . . . . . . . . . . 14 69 7.2. Forward compatibility considerations . . . . . . . . . . . 15 70 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 71 9. Security Considerations . . . . . . . . . . . . . . . . . . . 15 72 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16 73 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 17 74 11.1. Normative references . . . . . . . . . . . . . . . . . . . 17 75 11.2. Informative references . . . . . . . . . . . . . . . . . . 17 76 Appendix A. Change log . . . . . . . . . . . . . . . . . . . . . 17 77 A.1. Changes from draft-alvestrand-00 to -01 . . . . . . . . . 17 78 A.2. Changes from alvestrand-01 to -02 . . . . . . . . . . . . 17 79 A.3. Changes from alvestrand-02 to -03 . . . . . . . . . . . . 18 80 A.4. Changes from alvestrand-03 to -04 . . . . . . . . . . . . 18 81 A.5. Changes from draft-alvestrand-04 to draft-ietf -00 . . . . 18 82 A.6. Changes from idnabis -00 to -01 . . . . . . . . . . . . . 18 83 A.7. Changes from idnabis -01 to -02 . . . . . . . . . . . . . 19 84 A.8. Changes from idnabis -02 to -03 . . . . . . . . . . . . . 19 85 A.9. Changes from idnabis -03 to -04 . . . . . . . . . . . . . 19 86 A.10. Changes from idnabis -04 to -05 . . . . . . . . . . . . . 19 87 A.11. Changes from idnabis -05 to -06 . . . . . . . . . . . . . 20 88 A.12. Changes from idnabis -06 to -07 . . . . . . . . . . . . . 20 89 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20 91 1. Introduction 93 1.1. Purpose and applicability 95 The purpose of this document is to establish a rule that can be 96 applied to Internationalized Domain Name (IDN) labels in Unicode form 97 (U-labels) containing characters from scripts that are written from 98 right to left. It is part of the revised IDNA protocol defined in 99 [I-D.ietf-idnabis-protocol]. 101 When labels satisfy the rule, and when certain other conditions are 102 satisfied, there is only a minimal chance of these labels being 103 displayed in a confusing way by the Unicode bidirectional display 104 algorithm. 106 The other normative documents in the IDNA2008 document set establish 107 criteria for valid labels, including listing the permitted 108 characters. This document establishes additional validity criteria 109 for labels in scripts normally written from right to left. 111 This specification is not intended to place any requirements on 112 domain names that do not contain characters from such scripts. 114 1.2. Background and history 116 The "Stringprep" specification [RFC3454], part of IDNA2003, made the 117 following statement in its section 6 on the BIDI algorithm: 119 3) If a string contains any RandALCat character, a RandALCat 120 character MUST be the first character of the string, and a 121 RandALCat character MUST be the last character of the string. 123 (A RandALCat character is a character with unambiguously right-to- 124 left directionality.) 126 The reasoning behind this prohibition was to ensure that every 127 component of a displayed domain name has an unambiguously preferred 128 direction. However, this made certain words in languages written 129 with right-to-left scripts invalid as IDN labels, and in at least one 130 case (Dhivehi) meant that all the words of an entire language were 131 forbidden as IDN labels. 133 This is illustrated below with examples taken from the Dhivehi and 134 Yiddish languages, as written with the Thaana and Hebrew scripts, 135 respectively. 137 RFC 3454 did not explicitly state the requirement to be fulfilled. 138 Therefore, it is impossible to determine whether a simple relaxation 139 of the rule would continue to fulfil the requirement. 141 While this document specifies rules quite different from RFC 3454, 142 most reasonable labels that were allowed under RFC 3454 will also be 143 allowed under this specification (the most important example of non- 144 permitted labels being labels that mix Arabic and European digits (AN 145 and EN) inside an RTL label, and labels that use AN in an LTR label - 146 see section Section 1.4 for terminology), so the operational impact 147 of using the new rule in the updated IDNA specification is limited. 149 1.3. Structure of the rest of this document 151 Section 2 defines a rule, the "BIDI rule", which can be used on a 152 domain name label to check how safe it is to use in a domain name of 153 possibly mixed directionality. The primary initial use of this rule 154 is as part of the IDNA2008 protocol[I-D.ietf-idnabis-protocol]. 156 Section 3 sets out the requirements for defining the BIDI rule. 158 Section 4 gives detailed examples that serve as justification for the 159 new rule. 161 Section 5 to Section 9 describe various situations that can occur 162 when dealing with domain names with characters of different 163 directionality. 165 Only Section 1.4 and Section 2 are normative. 167 1.4. Terminology 169 The terminology used to describe IDNA concepts is defined in 170 [I-D.ietf-idnabis-defs] 172 The terminology used for the BIDI properties of Unicode characters is 173 taken from the Unicode Standard. [Unicode] 175 The Unicode standard specifies a BIDI property for each character, 176 which controls the character's behaviour in the Unicode bidirectional 177 algorithm [UAX9]. For reference, here are the values that the 178 Unicode BIDI property can have: 180 o L - Left-to-right - most letters in LTR scripts 182 o R - Right-to-left - most letters in non-Arabic RTL scripts 184 o AL - Arabic letters - most letters in the Arabic script 185 o EN - European Number (0-9, and Extended Arabic-Indic numbers) 187 o ES - European Number Separator (+ and -) 189 o ET - European Number Terminator (currency symbols, the hash sign, 190 the percent sign and so on) 192 o AN - Arabic Number; this encompasses the Arabic-Indic numbers, but 193 not the Extended Arabic-Indic numbers 195 o CS - Common Number Separator (. , / : et al) 197 o NSM - Non spacing Mark - most combining accents 199 o BN - Boundary Neutral - control characters (ZWNJ, ZWJ and others) 201 o B - Paragraph Separator 203 o S - Segment Separator 205 o WS - Whitespace, including the SPACE character 207 o ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT 209 o LRE, LRO, RLE, RLO, PDF - these are "directional control 210 characters", and are not used in IDNA labels. 212 In this memo, we use "network order" to describe the sequence of 213 characters as transmitted on the wire or stored in a file; the terms 214 "first", "next", "previous", "beginning", "end", "before" and "after" 215 are used to refer to the relationship of characters and labels in 216 network order. 218 We use "display order" to talk about the sequence of characters as 219 imaged on a display medium; the terms "left" and "right" are used to 220 refer to the relationship of characters and labels in display order. 222 Most of the time, the examples use the abbreviations for the Unicode 223 BIDI classes to denote the directionality of the characters; the 224 example string CS L consists of one character of class CS and one 225 character of class L. In some examples, the convention that uppercase 226 characters are of class R or AL, and lowercase characters are of 227 class L is used - thus, the example string ABC.abc would consist of 3 228 right-to-left characters and 3 left-to-right characters. 230 The directionality of such examples is determined by context - for 231 instance, in the sentence "ABC.abc is displayed as CBA.abc", the 232 first example string is in network order, the second example string 233 is in display order. 235 The term "paragraph" is used in the sense of the Unicode BIDI 236 specification [UAX9] - it means "a block of text that has an overall 237 direction, either left-to-right or right-to-left", approximately; see 238 UAX 9 for the details. 240 "RTL" and "LTR" are abbreviations for "right to left" and "left to 241 right", respectively. 243 An RTL label is a label that contains at least one character of type 244 R, AL or AN. 246 An LTR label is any label that is not an RTL label. 248 A "BIDI domain name" is a domain name that contains at least one RTL 249 label. (Note: This definition includes domain names containing only 250 dots and right-to-left characters. Providing a separate category of 251 "RTL domain names" would not make this specification simpler, so has 252 not been done.) 254 2. The BIDI Rule 256 The following rule, consisting of six conditions, applies to labels 257 in BIDI domain names. The requirements that this rule satisfies are 258 described in Section 3. All the conditions must be satisfied for the 259 rule to be satisfied. 261 1. The first character must be a character with BIDI property L, R 262 or AL. If it has the R or AL property, it is an RTL label; if it 263 has the L property, it is an LTR label. 265 2. In an RTL label, only characters with the BIDI properties R, AL, 266 AN, EN, ES, CS, ET, ON, BN and NSM are allowed. 268 3. In an RTL label, the end of the label must be a character with 269 BIDI property R, AL, EN or AN, followed by zero or more 270 characters with BIDI property NSM. 272 4. In an RTL label, if an EN is present, no AN may be present, and 273 vice versa. 275 5. In an LTR label, only characters with the BIDI properties L, EN, 276 ES, CS. ET, ON, BN and NSM are allowed. 278 6. In an LTR label, the end of the label must be a character with 279 BIDI property L or EN, followed by zero or more characters with 280 BIDI property NSM. 282 The following guarantees can be made based on the above: 284 o In a domain name consisting of only labels that satisfy the rule, 285 the requirements of Section 3 are satisfied. Note that even LTR 286 labels and pure ASCII labels have to be tested. 288 o In a domain name consisting of only LDH-labels and labels that 289 satisfy the rule, the requirements of Section 3 are satisfied as 290 long as a label that starts with an ASCII digit does not come 291 after a right-to-left label. 293 No guarantee is given for other combinations. 295 3. The requirement set for the BIDI rule 297 This document, unlike RFC 3454, proposes an explicit justification 298 for the BIDI rule, and states a set of requirements for which it is 299 possible to test whether or not the modified rule fulfils the 300 requirement. 302 All the text in this document assumes that text containing the labels 303 under consideration will be displayed using the Unicode bidirectional 304 algorithm [UAX9]. 306 The requirements proposed are these: 308 o Label Uniqueness: No two labels, when presented in display order 309 in the same paragraph, should have the same sequence of characters 310 without also having the same sequence of characters in network 311 order, both when the paragraph has LTR direction and when the 312 paragraph has RTL direction. (This is the criterion that is 313 explicit in RFC 3454). (Note that a label displayed in an RTL 314 paragraph may display the same as a different label displayed in 315 an LTR paragraph, and still satisfy this criterion.) 317 o Character Grouping: When displaying a string of labels, using the 318 Unicode BIDI algorithm to reorder the characters for display, the 319 characters of each label should remain grouped between the 320 characters delimiting the labels, both when the string is embedded 321 in a paragraph with LTR direction and when it is embedded in a 322 paragraph with RTL direction. 324 Several stronger statements were considered and rejected, because 325 they seem to be impossible to fulfil within the constraints of the 326 Unicode bidirectional algorithm. These include: 328 o The appearance of a label should be unaffected by its embedding 329 context. This proved impossible even for ASCII labels; the label 330 "123-A" will have a different display order in an RTL context than 331 in an LTR context. (This particular example is, however, 332 disallowed anyway.) 334 o The sequence of labels should be consistent with network order. 335 This proved impossible - a domain name consisting of the labels 336 (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in 337 an LTR context. (In an RTL context, it will be displayed as 338 L4.R3.R2.L1). 340 o No two domain names should be displayed the same, even under 341 differing directionality. This was shown to be unsound, since the 342 domain name (in network order) ABC.abc will have display order 343 CBA.abc in an LTR context and abc.CBA in an RTL context, while the 344 domain name (network) abc.ABC will have display order abc.CBA in 345 an LTR context and CBA.abc in an RTL context. 347 One possible requirement was thought to be problematic, but turned 348 out to be satisfied by a string that obeys the proposed rules: 350 o The Character Grouping requirement should be satisfied when 351 directional controls (LRE, RLE, RLO, LRO, PDF) are used in the 352 same paragraph (outside of the labels). Because these controls 353 affect presentation order in non-obvious ways, by affecting the 354 "sor" and "eor" properties of the Unicode BIDI algorithm, the 355 conditions above require extra testing in order to figure out 356 whether or not they influence the display of the domain name. 357 Testing found that for the strings allowed under the rule 358 presented in this document, directional controls do not influence 359 the display of the domain name. 361 This is still not stated as a requirement, since it did not seem as 362 important as those stated, but it is useful to know that BIDI domain 363 names where the labels satisfy the rule have this propierty. 365 In the following descriptions, first-level bullets are used to 366 indicate rules or normative statements; second-level bullets are 367 commentary. 369 The Character Grouping requirement can be more formally stated as: 371 o Let "Delimiterchars" be a set of characters with the Unicode BIDI 372 properties CS, WS, ON. (These are commonly used to delimit labels 373 - both the FULL STOP and the space are included. They are not 374 allowed in domain labels.) 375 * ET, though it commonly occurs next to domain names in practice, 376 is problematic: the context R CS L EN ET (for instance A.a1%) 377 makes the label L EN not satisfy the character grouping 378 requirement. 380 * ES commonly occurs in labels as HYPHEN-MINUS, but could also be 381 used as a delimiter (for instance, the plus sign). It is left 382 out here. 384 o Let "unproblematic label" be a label that either satisfies the 385 requirements, or does not contain any character with the BIDI 386 properties R, AL or AN, and does not begin with a character with 387 the BIDI property EN. (Informally, "it does not start with a 388 number".) 390 A label X satisfies the Character Grouping requirement when, for any 391 Delimiter Character D1 and D2, and for any label S1 and S2 that is an 392 unproblematic label or an empty string, the following holds true: 394 If the string formed by concatenating S1, D1, X, D2 and S2 is 395 reordered according to the BIDI algorithm, then all the characters of 396 X in the reordered string are between D1 and D2, and no other 397 characters are between D1 and D2, both if the overall paragraph 398 direction is LTR and if the overall paragraph direction is RTL. 400 Note that the definition is self-referential, since S1 and S2 are 401 constrained to be "legal" by this definition. This makes testing 402 changes to proposed rules a little complex, but does not create 403 problems for testing whether or not a given proposed rule satisfies 404 the criterion. 406 The "zero-length" case represents the case where a domain name is 407 next to something that isn't a domain name, separated by a delimiter 408 character. 410 Note about the position of BN: The Unicode bidirectional algorithm 411 specifies that a BN has an effect on the adjoining characters in 412 network order, not in display order, and are therefore treated as if 413 removed during BIDI processing ([UAX9] section 3.3.2 rule X9 and 414 section 5.3). Therefore, the question of "what position does a BN 415 have after reordering" is not meaningful. It has been ignored while 416 developing the rules here. 418 The Label Uniqueness requirement can be formally stated as: 420 If two non-identical labels X and Y, embedded as for the test above, 421 displayed in paragraphs with the same directionality, are reordered 422 by the BIDI algorithm into the same sequence of codepoints, the 423 labels X and Y cannot both be legal. 425 4. Examples of issues found with RFC 3454 427 4.1. Dhivehi 429 Dhivehi, the official language of the Maldives, is written with the 430 Thaana script. This script displays some of the characteristics of 431 Arabic script, including its directional properties, and the 432 indication of vowels by the diacritical marking of consonantal base 433 characters. This marking is obligatory, and both two consecutive 434 vowels and syllable-final consonants are indicated with unvoiced 435 combining marks. Every Dhivehi word therefore ends with a combining 436 mark. 438 The word for "computer", which is romanized as "konpeetaru", is 439 written with the following sequence of Unicode code points: 441 U+0786 THAANA LETTER KAAFU (AL) 443 U+07AE THAANA OBOFILI (NSM) 445 U+0782 THAANA LETTER NOONU (AL) 447 U+07B0 THAANA SUKUN (NSM) 449 U+0795 THAANA LETTER PAVIYANI (AL) 451 U+07A9 THAANA LETTER EEBEEFILI (AL) 453 U+0793 THAANA LETTER TAVIYANI (AL) 455 U+07A6 THAANA ABAFILI (NSM) 457 U+0783 THAANA LETTER RAA (AL) 459 U+07AA THAANA UBUFILI (NSM) 461 The directionality class of U+07AA in the Unicode database [Unicode] 462 is NSM (non-spacing mark), which is not R or AL; a conformant 463 implementation of the IDNA2003 algorithm will say that "this is not 464 in RandALCat", and refuse to encode the string. 466 4.2. Yiddish 468 Yiddish is one of several languages written with the Hebrew script 469 (others include Hebrew and Ladino). This is basically a consonantal 470 alphabet (also termed an "abjad") but Yiddish is written using an 471 extended form that is fully vocalic. The vowels are indicated in 472 several ways, of which one is by repurposing letters that are 473 consonants in Hebrew. Other letters are used both as vowels and 474 consonants, with combining marks, called "points", used to 475 differentiate between them. Finally, some base characters can 476 indicate several different vowels, which are also disambiguated by 477 combining marks. Pointed characters can appear in word-final 478 position and may therefore also be needed at the end of labels. This 479 is not an invariable attribute of a Yiddish string and there is thus 480 greater latitude here than there is with Dhivehi. 482 The organization now known as the "YIVO Institute for Jewish 483 Research" developed orthographic rules for modern Standard Yiddish 484 during the 1930s on the basis of work conducted in several venues 485 since earlier in that century. These are given in, "The Standardized 486 Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken 487 as normatively descriptive of modern Standard Yiddish in any context 488 where that notion is deemed relevant. They have been applied 489 exclusively in all formal Yiddish dictionaries published since their 490 establishment, and are similarly dominant in academic and 491 bibliographic regards. 493 It therefore appears appropriate for this repertoire also to be 494 supported fully by IDNA. This presents no difficulty with characters 495 in initial and medial positions, but pointed characters are regularly 496 used in final position as well. All of the characters in the SYO 497 repertoire appear in both marked and unmarked form with one 498 exception: the HEBREW LETTER PE (U+05E4). The SYO only permits this 499 with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent 500 to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent 501 to the Latin letter "f". There is, however, a separate unpointed 502 allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter 503 character when it appears in final position. The constraint on the 504 use of the SYO repertoire resulting from the proscription of 505 combining marks at the end of RTL strings thus reduces to nothing 506 more, or less, than the equivalent of saying that a string of Latin 507 characters cannot end with the letter "p". It must also be noted 508 that the HEBREW LETTER PE with HEBREW POINT DAGESH is characteristic 509 of almost all traditional Yiddish orthographies that predate (or 510 remain in use in parallel to) the SYO, being the first pointed 511 character to appear in any of them. 513 A more general instantiation of the basic problem can be seen in the 514 representation of the YIVO acronym. This acronym is written with the 515 Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and 516 QAMATS are combining points. The Unicode codepoints are: 518 U+05D9 HEBREW LETTER YOD (R) 520 U+05B4 HEBREW POINT HIRIQ (NSM) 522 U+05D5 HEBREW LETTER VAV (R) 524 U+05D0 HEBREW LETTER ALEF (R) 526 U+05B8 HEBREW POINT QAMATS (NSM) 528 The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode 529 database is NSM, which again causes the IDNA2003 algorithm to reject 530 the string. 532 It may also be noted that all of the combined characters mentioned 533 above exist in precomposed form at separate positions in the Unicode 534 chart. However, by invoking Stringprep, the IDNA2003 algorithm also 535 rejects those codepoints, for reasons not discussed here. 537 4.3. Strings with numbers 539 By requiring that the first or last character of a string be category 540 R or AL, RFC 3454 prohibited a string containing right-to-left 541 characters from ending with a number. 543 Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5 544 ALEF. Displayed in an LTR context, the first one will be displayed 545 from left to right as 5 ALEF (with the 5 being considered right-to- 546 left because of the leading ALEF), while 5 ALEF will be displayed in 547 exactly the same order (5 taking the direction from context). 548 Clearly, only one of those should be permitted as a registered label, 549 but barring them both seems unnecessary. 551 5. Troublesome situations and guidelines 553 There are situations in which labels that satisfy the rule above will 554 be displayed in a surprising fashion. The most important of these is 555 the case where a label ending in a character with BIDI property AL, 556 AN or R occurs before a label beginning with a character of BIDI 557 property EN. In that case, the number will appear to move into the 558 label containing the right-to-left character, violating the Character 559 Grouping requirement. 561 If the label that occurs after the right-to-left label itself 562 satisfies the BIDI criterion, the requirements will be satisfied in 563 all cases (this is the reason why the criterion talks about strings 564 containing L in some cases). However, the WG concluded that this 565 could not be required for several reasons: 567 o There is a large current deployment of ASCII domain names starting 568 with digits. These cannot possibly be invalidated. 570 o Domain names are often constructed piecemeal, for instance by 571 combining a string with the content of a search list. This may 572 occur after IDNA processing, and thus in part of the code that is 573 not IDNA-aware, making detection of the undesirable combination 574 impossible. 576 o Even if a label is registered under a "safe" label, there may be a 577 DNAME [RFC2672] with an "unsafe" label that points to the "safe" 578 label, thus creating seemingly-valid names that would not satisfy 579 the criterion. 581 o Wildcards create the odd situation where a label is "valid" (can 582 be looked up successfully) without the zone owner knowing that 583 this label exists. So an owner of a zone whose name starts with a 584 digit and contains a wildcard has no way of controlling whether or 585 not names with RTL labels in them are looked up in his zone. 587 Rather than trying to suggest rules that disallow all such 588 undesirable situations, this document merely warns about the 589 possibility, and leaves it to application developers to take whatever 590 measures they deem appropriate to avoid problematic situations. 592 6. Other issues in need of resolution 594 This document concerns itself only with the rules that are needed 595 when dealing with domain names with characters that have differing 596 BIDI properties, and considers characters only in terms of their BIDI 597 properties. All other issues with scripts that are written from 598 right to left must be considered in other contexts. 600 One such issue is the need to keep numbers separate. Several scripts 601 are used with multiple sets of numbers - most commonly they use Latin 602 numbers and a script-specific set of numbers, but in the case of 603 Arabic, there are 2 sets of "Arabic-Indic" digits involved. 605 The algorithm in this document disallows occurrences of AN-class 606 characters ("Arabic-Indic digits", U+0660 to U+0669) together with 607 EN-class characters (which includes "European" digits, U+0030 to 608 U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but 609 does not help in preventing the mixing of, for instance, Bengali 610 digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF), 611 both of which have BIDI class L. A registry or script community that 612 wishes to create rules restricting the mixing of digits in a label 613 will be able to specify these restrictions at the registry level. 614 Some rules are also specified at the protocol level. 616 Another set of issues concerns the proper display of IDNs with a 617 mixture of LTR and RTL labels, or only RTL labels. 619 It is unrealistic to expect that applications will display domain 620 names using embedded formatting codes between their labels (for one 621 thing, no reliable algorithms for identifying domain names in running 622 text exist); thus, the display order will be determined by the BIDI 623 algorithm. Thus, a sequence (in network order) of R1.R2.ltr will be 624 displayed in the order 2R.1R.ltr in an LTR context, which might 625 surprise someone expecting to see labels displayed in hierarchical 626 order. People used to working with text that mixes LTR and RTL 627 strings might not be so surprised by this. Again, this memo does not 628 attempt to suggest a solution to this problem. 630 7. Compatibility considerations 632 7.1. Backwards compatibility considerations 634 As with any change to an existing standard, it is important to 635 consider what happens with existing implementations when the change 636 is introduced. Some troublesome cases include: 638 o Old program used to input the newly-allowed label. If the old 639 program checks the input against RFC 3454, some labels will not be 640 allowed, and domain names containing those labels will remain 641 inaccessible. 643 o Old program is asked to display the newly-allowed label, and 644 checks it against RFC 3454 before displaying. The program will 645 perform some kind of fallback, most likely displaying the label in 646 A-label form. 648 o Old program tries to display the newly-allowed label. If the old 649 program has code for displaying the last character of a label that 650 is different from the code used to display the characters in the 651 middle of the label, the display may be inconsistent and cause 652 confusion. 654 One particular example of the last case is if a program chooses to 655 examine the last character (in network order) of a string in order to 656 determine its directionality, rather than its first. If it finds an 657 NSM character and tries to display the string as if it was a left-to- 658 right string, the resulting display may be interesting, but not 659 useful. 661 The editors believe that these cases will have less harmful impact in 662 practice than continuing to deny the use of words from the languages 663 for which these strings are necessary as IDN labels. 665 This specification does not forbid using leading European digits in 666 ASCII-only labels, since this would conflict with a large installed 667 base of such labels, and would increase the scope of the 668 specification from RTL labels to all labels. The harm resulting from 669 this limitation of scope is described in Section 5. Registries and 670 private zone managers can check for this particular condition before 671 they allow registration of any RTL label. Generally it is best to 672 disallow registration of any right-to-left strings in a zone where 673 the label at the level above begins with a digit. 675 7.2. Forward compatibility considerations 677 This text is intentionally specified strictly in terms of the Unicode 678 BIDI properties. The determination that the condition is sufficient 679 to fulfil the criteria depends on the Unicode BIDI algorithm; it is 680 unlikely that drastic changes will be made to this algorithm. 682 However, the determination of validity for any string depends on the 683 Unicode BIDI property values, which are not declared immutable by the 684 Unicode Consortium. Furthermore, the behaviour of the algorithm for 685 any given character is likely to be linguistically and culturally 686 sensitive, so that while it should occur rarely, it is possible that 687 later versions of the Unicode standard may change the BIDI properties 688 assigned to certain Unicode characters. 690 This memo does not propose a solution for this problem. 692 8. IANA Considerations 694 This document makes no request of IANA. 696 Note to RFC Editor: this section may be removed on publication as an 697 RFC. 699 9. Security Considerations 701 The display behaviour of mixed-direction text can be extremely 702 surprising to users who are not used to it; for instance, cut and 703 paste of a piece of text can cause the text to display differently at 704 the destination, if the destination is in another directionality 705 context, and adding a character in one place of a text can cause 706 characters some distance from the point of insertion to change their 707 display position. This is, however, not a phenomenon unique to the 708 display of domain names. 710 The new IDNA protocol, and particularly these new BIDI rules, will 711 allow some strings to be used in IDNA contexts that are not allowed 712 today. It is possible that differences in the interpretation of 713 labels between implementations of IDNA2003 and IDNA2008 could pose a 714 security risk, but it is difficult to envision any specific 715 instantiation of this. 717 Any rational attempt to compute, for instance, a hash over an 718 identifier processed by IDNA would use network order for its 719 computation, and thus be unaffected by the new rules proposed here. 721 While it is not believed to pose a problem, if display routines had 722 been written with specific knowledge of the RFC 3454 IDNA 723 prohibitions, it is possible that the potential problems noted under 724 "backwards compatibility" could cause new kinds of confusion. 726 10. Acknowledgements 728 While the listed editors held the pen, this document represents the 729 joint work and conclusions of an ad hoc design team. In addition to 730 the editors this consisted of, in alphabetic order, Tina Dam, Patrik 731 Faltstrom, and John Klensin. Many further specific contributions and 732 helpful comments were received from the people listed below, and 733 others who have contributed to the development and use of the IDNA 734 protocols. 736 The particular formulation of the BIDI rule in section 2 was 737 suggested by Matitiahu Allouche. 739 The team wishes in particular to thank Roozbeh Pournader for calling 740 its attention to the issue with the Thaana script, Paul Hoffman for 741 pointing out the need to be explicit about backwards compatibility 742 considerations, Ken Whistler for suggesting the basis of the 743 formalized "character grouping" requirement, Mark Davis for 744 commentary, Erik van der Poel for careful review, comments and 745 verification of the rulesets, Marcos Sanz, Andrew Sullivan and Pete 746 Resnick for reviews, and Vint Cerf for chairing the working group and 747 contributing massively to getting the documents finished. 749 11. References 750 11.1. Normative references 752 [I-D.ietf-idnabis-defs] 753 Klensin, J., "Internationalized Domain Names for 754 Applications (IDNA): Definitions and Document Framework", 755 draft-ietf-idnabis-defs-12 (work in progress), 756 October 2009. 758 [UAX9] Davis, M., "Unicode Standard Annex #9: The Bidirectional 759 Algorithm, revision 19", 03 2008. 761 [Unicode] Unicode, "The Unicode Standard - version 5.2", 2008. 763 11.2. Informative references 765 [I-D.ietf-idnabis-protocol] 766 Klensin, J., "Internationalized Domain Names in 767 Applications (IDNA): Protocol", 768 draft-ietf-idnabis-protocol-17 (work in progress), 769 October 2009. 771 [RFC2672] Crawford, M., "Non-Terminal DNS Name Redirection", 772 RFC 2672, August 1999. 774 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 775 Internationalized Strings ("stringprep")", RFC 3454, 776 December 2002. 778 [SYO] "The Standardized Yiddish Orthography: Rules of Yiddish 779 Spelling, 6th ed., , New York, ISBN 0-914512-25-0",", 780 1999. 782 Appendix A. Change log 784 This appendix is intended to be removed by the RFC Editor when this 785 document is published as an RFC. 787 A.1. Changes from draft-alvestrand-00 to -01 789 Suggested a possible new algorithm. 791 Multiple smaller changes. 793 A.2. Changes from alvestrand-01 to -02 795 Date of publication updated. 797 Change log added. 799 A.3. Changes from alvestrand-02 to -03 801 Intro changed to reflect addressing the deeper issues with the BIDI 802 algorithm. 804 Gave formalized criteria for "valid strings", and documented the new 805 set of requirements for strings that satisfy the criteria. 807 Removed most of section 5, "Other problems", and noted that this memo 808 focuses ONLY on issues that can be evaluated by looking at the BIDI 809 properties of characters. 811 A.4. Changes from alvestrand-03 to -04 813 Added back AN to the list of allowed characters; it had been left out 814 by accident in -03. 816 Removed some rules that were redundant. 818 Added some considerations for backwards compatibility and interaction 819 with ASCII labels that start with a number. 821 Mentioned the issue with DNAME pointing to a zone containing RTL 822 labels in the security considerations section. 824 Wording updates in multiple places, including some spelling errors. 826 Rewrote the introduction section. 828 Split references into "normative" and "informative". 830 A.5. Changes from draft-alvestrand-04 to draft-ietf -00 832 Changed name of draft. 834 Added a couple of "note in draft" statements to remind the WG of open 835 issues. 837 Noted that BIDI controls in the paragraph are unproblematic with the 838 given ruleset. 840 A.6. Changes from idnabis -00 to -01 842 Added text to section 5 describing issues with mixture of numbers in 843 labels 844 Addressed some of the issues raised by Mark Davis in March 2008 in 845 regard to document clarity. 847 Changed the formulation of the label uniqueness requirement to be 848 consistent with the text under "Labels with numbers". 850 Spell-checked document. 852 A.7. Changes from idnabis -01 to -02 854 Changed the domain of applicability to be only labels containing RTL 855 characters, described the conditions under which harm may result from 856 putting RTL labels next to other labels, and how to detect them. 858 A number of clarification and formatting changes in response to 859 reviews. 861 A.8. Changes from idnabis -02 to -03 863 Rearranged section list so that the normative material is collected 864 at the front. 866 Moved list of BIDI properties into "terminology" 868 Clarified that only terminology and the BIDI rule is normative 870 Changed reference to point to -defs for definitions instead of 871 -rationale 873 Minor fixes in response to comments, wording cleanups, removed all 874 tentative language. 876 A.9. Changes from idnabis -03 to -04 878 Updated to new IPR rules. 880 Minor textual clarifications. 882 Replaced the BIDI test with a version suggested by Matitiahu Allouche 883 - this description is simpler to understand than the one in -03, and 884 generates a larger set of allowable strings, while all tests indicate 885 that they still pass all the criteria. 887 A.10. Changes from idnabis -04 to -05 889 Minor textual clarifications resulting from WG Last Call. No 890 technical changes. 892 Updated UAX9 reference to Unicode 5.1 version. 894 Made better use of some terminology, and clarified the relationship 895 with RFC 3454 based on input from Paul Hoffman. 897 Added examples of newly-forbidden labels, based on advice from Andrew 898 Sullivan 900 A.11. Changes from idnabis -05 to -06 902 Most of these changes are based on a review by Martin Duerst. 904 Rewrote abstract. 906 Changed "test" to "rule" throughout, with accompanying minor tweaks 908 Re-allowed BN in LTR labels (error introduced in -04). 910 Added words to explain role of BN more (in the requirements section). 912 Modified the words about the effect of BIDI changes after having 913 reassurance that changes are likely to be rare. 915 Minor textual fixes. 917 A.12. Changes from idnabis -06 to -07 919 Added a note in the intro saying explicitly that other parts of 920 IDNABIS specify which characters are legal (in response to a Last 921 Call comment from Joel Halpern). 923 Inserted an explicit pointer to Dhivehi and a couple of other 924 clarifying changes to the (non-normative) section 4. 926 Mentioned Vint Cerf in the acknowledgements. 928 Authors' Addresses 930 Harald Tveit Alvestrand (editor) 931 Google 932 Beddingen 10 933 Trondheim, 7014 934 Norway 936 Email: harald@alvestrand.no 937 Cary Karp 938 Swedish Museum of Natural History 939 Frescativ. 40 940 Stockholm, 10405 941 Sweden 943 Phone: +46 8 5195 4055 944 Fax: 945 Email: ck@nrm.museum 946 URI: