idnits 2.17.1 

draft-ietf-idnabis-bidi-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** The document seems to lack a License Notice according IETF Trust
     Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009
     Section 6.b -- however, there's a paragraph with a matching beginning.
     Boilerplate error?

     (You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Feb 2009 rather than one of the newer Notices.  See
     https://trustee.ietf.org/license-info/.)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the
     document.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 120: '...      character MUST be the first char...'
     RFC 2119 keyword, line 121: '...dALCat character MUST be the last char...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (January 14, 2010) is 5215 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-13) exists of
     draft-ietf-idnabis-defs-12

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX9'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode'

  == Outdated reference: A later version (-18) exists of
     draft-ietf-idnabis-protocol-17

  -- Obsolete informational reference (is this intentional?): RFC 2672
     (Obsoleted by RFC 6672)

  -- Obsolete informational reference (is this intentional?): RFC 3454
     (Obsoleted by RFC 7564)


     Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                 H. Alvestrand, Ed.
3	Internet-Draft                                                    Google
4	Intended status: Standards Track                                 C. Karp
5	Expires: July 18, 2010                 Swedish Museum of Natural History
6	                                                        January 14, 2010

8	                     Right-to-left scripts for IDNA
9	                       draft-ietf-idnabis-bidi-07

11	Status of this Memo

13	   This Internet-Draft is submitted to IETF in full conformance with the
14	   provisions of BCP 78 and BCP 79.

16	   Internet-Drafts are working documents of the Internet Engineering
17	   Task Force (IETF), its areas, and its working groups.  Note that
18	   other groups may also distribute working documents as Internet-
19	   Drafts.

21	   Internet-Drafts are draft documents valid for a maximum of six months
22	   and may be updated, replaced, or obsoleted by other documents at any
23	   time.  It is inappropriate to use Internet-Drafts as reference
24	   material or to cite them other than as "work in progress."

26	   The list of current Internet-Drafts can be accessed at
27	   http://www.ietf.org/ietf/1id-abstracts.txt.

29	   The list of Internet-Draft Shadow Directories can be accessed at
30	   http://www.ietf.org/shadow.html.

32	   This Internet-Draft will expire on July 18, 2010.

34	Copyright Notice

36	   Copyright (c) 2010 IETF Trust and the persons identified as the
37	   document authors.  All rights reserved.

39	   This document is subject to BCP 78 and the IETF Trust's Legal
40	   Provisions Relating to IETF Documents in effect on the date of
41	   publication of this document (http://trustee.ietf.org/license-info).
42	   Please review these documents carefully, as they describe your rights
43	   and restrictions with respect to this document.

45	Abstract

47	   The use of right-to-left scripts in internationalized domain names
48	   has presented several challenges.  This memo proposes a new BIDI rule
49	   for IDNA labels, based on the encountered problems with some scripts,
50	   and some shortcomings in the 2003 IDNA BIDI criterion.

52	Table of Contents

54	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
55	     1.1.  Purpose and applicability  . . . . . . . . . . . . . . . .  3
56	     1.2.  Background and history . . . . . . . . . . . . . . . . . .  3
57	     1.3.  Structure of the rest of this document . . . . . . . . . .  4
58	     1.4.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  4
59	   2.  The BIDI Rule  . . . . . . . . . . . . . . . . . . . . . . . .  6
60	   3.  The requirement set for the BIDI rule  . . . . . . . . . . . .  7
61	   4.  Examples of issues found with RFC 3454 . . . . . . . . . . . . 10
62	     4.1.  Dhivehi  . . . . . . . . . . . . . . . . . . . . . . . . . 10
63	     4.2.  Yiddish  . . . . . . . . . . . . . . . . . . . . . . . . . 10
64	     4.3.  Strings with numbers . . . . . . . . . . . . . . . . . . . 12
65	   5.  Troublesome situations and guidelines  . . . . . . . . . . . . 12
66	   6.  Other issues in need of resolution . . . . . . . . . . . . . . 13
67	   7.  Compatibility considerations . . . . . . . . . . . . . . . . . 14
68	     7.1.  Backwards compatibility considerations . . . . . . . . . . 14
69	     7.2.  Forward compatibility considerations . . . . . . . . . . . 15
70	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 15
71	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 15
72	   10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16
73	   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 17
74	     11.1. Normative references . . . . . . . . . . . . . . . . . . . 17
75	     11.2. Informative references . . . . . . . . . . . . . . . . . . 17
76	   Appendix A.  Change log  . . . . . . . . . . . . . . . . . . . . . 17
77	     A.1.  Changes from draft-alvestrand-00 to -01  . . . . . . . . . 17
78	     A.2.  Changes from alvestrand-01 to -02  . . . . . . . . . . . . 17
79	     A.3.  Changes from alvestrand-02 to -03  . . . . . . . . . . . . 18
80	     A.4.  Changes from alvestrand-03 to -04  . . . . . . . . . . . . 18
81	     A.5.  Changes from draft-alvestrand-04 to draft-ietf -00 . . . . 18
82	     A.6.  Changes from idnabis -00 to -01  . . . . . . . . . . . . . 18
83	     A.7.  Changes from idnabis -01 to -02  . . . . . . . . . . . . . 19
84	     A.8.  Changes from idnabis -02 to -03  . . . . . . . . . . . . . 19
85	     A.9.  Changes from idnabis -03 to -04  . . . . . . . . . . . . . 19
86	     A.10. Changes from idnabis -04 to -05  . . . . . . . . . . . . . 19
87	     A.11. Changes from idnabis -05 to -06  . . . . . . . . . . . . . 20
88	     A.12. Changes from idnabis -06 to -07  . . . . . . . . . . . . . 20
89	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20

91	1.  Introduction

93	1.1.  Purpose and applicability

95	   The purpose of this document is to establish a rule that can be
96	   applied to Internationalized Domain Name (IDN) labels in Unicode form
97	   (U-labels) containing characters from scripts that are written from
98	   right to left.  It is part of the revised IDNA protocol defined in
99	   [I-D.ietf-idnabis-protocol].

101	   When labels satisfy the rule, and when certain other conditions are
102	   satisfied, there is only a minimal chance of these labels being
103	   displayed in a confusing way by the Unicode bidirectional display
104	   algorithm.

106	   The other normative documents in the IDNA2008 document set establish
107	   criteria for valid labels, including listing the permitted
108	   characters.  This document establishes additional validity criteria
109	   for labels in scripts normally written from right to left.

111	   This specification is not intended to place any requirements on
112	   domain names that do not contain characters from such scripts.

114	1.2.  Background and history

116	   The "Stringprep" specification [RFC3454], part of IDNA2003, made the
117	   following statement in its section 6 on the BIDI algorithm:

119	      3) If a string contains any RandALCat character, a RandALCat
120	      character MUST be the first character of the string, and a
121	      RandALCat character MUST be the last character of the string.

123	   (A RandALCat character is a character with unambiguously right-to-
124	   left directionality.)

126	   The reasoning behind this prohibition was to ensure that every
127	   component of a displayed domain name has an unambiguously preferred
128	   direction.  However, this made certain words in languages written
129	   with right-to-left scripts invalid as IDN labels, and in at least one
130	   case (Dhivehi) meant that all the words of an entire language were
131	   forbidden as IDN labels.

133	   This is illustrated below with examples taken from the Dhivehi and
134	   Yiddish languages, as written with the Thaana and Hebrew scripts,
135	   respectively.

137	   RFC 3454 did not explicitly state the requirement to be fulfilled.
138	   Therefore, it is impossible to determine whether a simple relaxation
139	   of the rule would continue to fulfil the requirement.

141	   While this document specifies rules quite different from RFC 3454,
142	   most reasonable labels that were allowed under RFC 3454 will also be
143	   allowed under this specification (the most important example of non-
144	   permitted labels being labels that mix Arabic and European digits (AN
145	   and EN) inside an RTL label, and labels that use AN in an LTR label -
146	   see section Section 1.4 for terminology), so the operational impact
147	   of using the new rule in the updated IDNA specification is limited.

149	1.3.  Structure of the rest of this document

151	   Section 2 defines a rule, the "BIDI rule", which can be used on a
152	   domain name label to check how safe it is to use in a domain name of
153	   possibly mixed directionality.  The primary initial use of this rule
154	   is as part of the IDNA2008 protocol[I-D.ietf-idnabis-protocol].

156	   Section 3 sets out the requirements for defining the BIDI rule.

158	   Section 4 gives detailed examples that serve as justification for the
159	   new rule.

161	   Section 5 to Section 9 describe various situations that can occur
162	   when dealing with domain names with characters of different
163	   directionality.

165	   Only Section 1.4 and Section 2 are normative.

167	1.4.  Terminology

169	   The terminology used to describe IDNA concepts is defined in
170	   [I-D.ietf-idnabis-defs]

172	   The terminology used for the BIDI properties of Unicode characters is
173	   taken from the Unicode Standard. [Unicode]

175	   The Unicode standard specifies a BIDI property for each character,
176	   which controls the character's behaviour in the Unicode bidirectional
177	   algorithm [UAX9].  For reference, here are the values that the
178	   Unicode BIDI property can have:

180	   o  L - Left-to-right - most letters in LTR scripts

182	   o  R - Right-to-left - most letters in non-Arabic RTL scripts

184	   o  AL - Arabic letters - most letters in the Arabic script
185	   o  EN - European Number (0-9, and Extended Arabic-Indic numbers)

187	   o  ES - European Number Separator (+ and -)

189	   o  ET - European Number Terminator (currency symbols, the hash sign,
190	      the percent sign and so on)

192	   o  AN - Arabic Number; this encompasses the Arabic-Indic numbers, but
193	      not the Extended Arabic-Indic numbers

195	   o  CS - Common Number Separator (. , / : et al)

197	   o  NSM - Non spacing Mark - most combining accents

199	   o  BN - Boundary Neutral - control characters (ZWNJ, ZWJ and others)

201	   o  B - Paragraph Separator

203	   o  S - Segment Separator

205	   o  WS - Whitespace, including the SPACE character

207	   o  ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT

209	   o  LRE, LRO, RLE, RLO, PDF - these are "directional control
210	      characters", and are not used in IDNA labels.

212	   In this memo, we use "network order" to describe the sequence of
213	   characters as transmitted on the wire or stored in a file; the terms
214	   "first", "next", "previous", "beginning", "end", "before" and "after"
215	   are used to refer to the relationship of characters and labels in
216	   network order.

218	   We use "display order" to talk about the sequence of characters as
219	   imaged on a display medium; the terms "left" and "right" are used to
220	   refer to the relationship of characters and labels in display order.

222	   Most of the time, the examples use the abbreviations for the Unicode
223	   BIDI classes to denote the directionality of the characters; the
224	   example string CS L consists of one character of class CS and one
225	   character of class L. In some examples, the convention that uppercase
226	   characters are of class R or AL, and lowercase characters are of
227	   class L is used - thus, the example string ABC.abc would consist of 3
228	   right-to-left characters and 3 left-to-right characters.

230	   The directionality of such examples is determined by context - for
231	   instance, in the sentence "ABC.abc is displayed as CBA.abc", the
232	   first example string is in network order, the second example string
233	   is in display order.

235	   The term "paragraph" is used in the sense of the Unicode BIDI
236	   specification [UAX9] - it means "a block of text that has an overall
237	   direction, either left-to-right or right-to-left", approximately; see
238	   UAX 9 for the details.

240	   "RTL" and "LTR" are abbreviations for "right to left" and "left to
241	   right", respectively.

243	   An RTL label is a label that contains at least one character of type
244	   R, AL or AN.

246	   An LTR label is any label that is not an RTL label.

248	   A "BIDI domain name" is a domain name that contains at least one RTL
249	   label.  (Note: This definition includes domain names containing only
250	   dots and right-to-left characters.  Providing a separate category of
251	   "RTL domain names" would not make this specification simpler, so has
252	   not been done.)

254	2.  The BIDI Rule

256	   The following rule, consisting of six conditions, applies to labels
257	   in BIDI domain names.  The requirements that this rule satisfies are
258	   described in Section 3.  All the conditions must be satisfied for the
259	   rule to be satisfied.

261	   1.  The first character must be a character with BIDI property L, R
262	       or AL.  If it has the R or AL property, it is an RTL label; if it
263	       has the L property, it is an LTR label.

265	   2.  In an RTL label, only characters with the BIDI properties R, AL,
266	       AN, EN, ES, CS, ET, ON, BN and NSM are allowed.

268	   3.  In an RTL label, the end of the label must be a character with
269	       BIDI property R, AL, EN or AN, followed by zero or more
270	       characters with BIDI property NSM.

272	   4.  In an RTL label, if an EN is present, no AN may be present, and
273	       vice versa.

275	   5.  In an LTR label, only characters with the BIDI properties L, EN,
276	       ES, CS.  ET, ON, BN and NSM are allowed.

278	   6.  In an LTR label, the end of the label must be a character with
279	       BIDI property L or EN, followed by zero or more characters with
280	       BIDI property NSM.

282	   The following guarantees can be made based on the above:

284	   o  In a domain name consisting of only labels that satisfy the rule,
285	      the requirements of Section 3 are satisfied.  Note that even LTR
286	      labels and pure ASCII labels have to be tested.

288	   o  In a domain name consisting of only LDH-labels and labels that
289	      satisfy the rule, the requirements of Section 3 are satisfied as
290	      long as a label that starts with an ASCII digit does not come
291	      after a right-to-left label.

293	   No guarantee is given for other combinations.

295	3.  The requirement set for the BIDI rule

297	   This document, unlike RFC 3454, proposes an explicit justification
298	   for the BIDI rule, and states a set of requirements for which it is
299	   possible to test whether or not the modified rule fulfils the
300	   requirement.

302	   All the text in this document assumes that text containing the labels
303	   under consideration will be displayed using the Unicode bidirectional
304	   algorithm [UAX9].

306	   The requirements proposed are these:

308	   o  Label Uniqueness: No two labels, when presented in display order
309	      in the same paragraph, should have the same sequence of characters
310	      without also having the same sequence of characters in network
311	      order, both when the paragraph has LTR direction and when the
312	      paragraph has RTL direction.  (This is the criterion that is
313	      explicit in RFC 3454).  (Note that a label displayed in an RTL
314	      paragraph may display the same as a different label displayed in
315	      an LTR paragraph, and still satisfy this criterion.)

317	   o  Character Grouping: When displaying a string of labels, using the
318	      Unicode BIDI algorithm to reorder the characters for display, the
319	      characters of each label should remain grouped between the
320	      characters delimiting the labels, both when the string is embedded
321	      in a paragraph with LTR direction and when it is embedded in a
322	      paragraph with RTL direction.

324	   Several stronger statements were considered and rejected, because
325	   they seem to be impossible to fulfil within the constraints of the
326	   Unicode bidirectional algorithm.  These include:

328	   o  The appearance of a label should be unaffected by its embedding
329	      context.  This proved impossible even for ASCII labels; the label
330	      "123-A" will have a different display order in an RTL context than
331	      in an LTR context.  (This particular example is, however,
332	      disallowed anyway.)

334	   o  The sequence of labels should be consistent with network order.
335	      This proved impossible - a domain name consisting of the labels
336	      (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in
337	      an LTR context.  (In an RTL context, it will be displayed as
338	      L4.R3.R2.L1).

340	   o  No two domain names should be displayed the same, even under
341	      differing directionality.  This was shown to be unsound, since the
342	      domain name (in network order) ABC.abc will have display order
343	      CBA.abc in an LTR context and abc.CBA in an RTL context, while the
344	      domain name (network) abc.ABC will have display order abc.CBA in
345	      an LTR context and CBA.abc in an RTL context.

347	   One possible requirement was thought to be problematic, but turned
348	   out to be satisfied by a string that obeys the proposed rules:

350	   o  The Character Grouping requirement should be satisfied when
351	      directional controls (LRE, RLE, RLO, LRO, PDF) are used in the
352	      same paragraph (outside of the labels).  Because these controls
353	      affect presentation order in non-obvious ways, by affecting the
354	      "sor" and "eor" properties of the Unicode BIDI algorithm, the
355	      conditions above require extra testing in order to figure out
356	      whether or not they influence the display of the domain name.
357	      Testing found that for the strings allowed under the rule
358	      presented in this document, directional controls do not influence
359	      the display of the domain name.

361	   This is still not stated as a requirement, since it did not seem as
362	   important as those stated, but it is useful to know that BIDI domain
363	   names where the labels satisfy the rule have this propierty.

365	   In the following descriptions, first-level bullets are used to
366	   indicate rules or normative statements; second-level bullets are
367	   commentary.

369	   The Character Grouping requirement can be more formally stated as:

371	   o  Let "Delimiterchars" be a set of characters with the Unicode BIDI
372	      properties CS, WS, ON.  (These are commonly used to delimit labels
373	      - both the FULL STOP and the space are included.  They are not
374	      allowed in domain labels.)
375	      *  ET, though it commonly occurs next to domain names in practice,
376	         is problematic: the context R CS L EN ET (for instance A.a1%)
377	         makes the label L EN not satisfy the character grouping
378	         requirement.

380	      *  ES commonly occurs in labels as HYPHEN-MINUS, but could also be
381	         used as a delimiter (for instance, the plus sign).  It is left
382	         out here.

384	   o  Let "unproblematic label" be a label that either satisfies the
385	      requirements, or does not contain any character with the BIDI
386	      properties R, AL or AN, and does not begin with a character with
387	      the BIDI property EN.  (Informally, "it does not start with a
388	      number".)

390	   A label X satisfies the Character Grouping requirement when, for any
391	   Delimiter Character D1 and D2, and for any label S1 and S2 that is an
392	   unproblematic label or an empty string, the following holds true:

394	   If the string formed by concatenating S1, D1, X, D2 and S2 is
395	   reordered according to the BIDI algorithm, then all the characters of
396	   X in the reordered string are between D1 and D2, and no other
397	   characters are between D1 and D2, both if the overall paragraph
398	   direction is LTR and if the overall paragraph direction is RTL.

400	   Note that the definition is self-referential, since S1 and S2 are
401	   constrained to be "legal" by this definition.  This makes testing
402	   changes to proposed rules a little complex, but does not create
403	   problems for testing whether or not a given proposed rule satisfies
404	   the criterion.

406	   The "zero-length" case represents the case where a domain name is
407	   next to something that isn't a domain name, separated by a delimiter
408	   character.

410	   Note about the position of BN: The Unicode bidirectional algorithm
411	   specifies that a BN has an effect on the adjoining characters in
412	   network order, not in display order, and are therefore treated as if
413	   removed during BIDI processing ([UAX9] section 3.3.2 rule X9 and
414	   section 5.3).  Therefore, the question of "what position does a BN
415	   have after reordering" is not meaningful.  It has been ignored while
416	   developing the rules here.

418	   The Label Uniqueness requirement can be formally stated as:

420	   If two non-identical labels X and Y, embedded as for the test above,
421	   displayed in paragraphs with the same directionality, are reordered
422	   by the BIDI algorithm into the same sequence of codepoints, the
423	   labels X and Y cannot both be legal.

425	4.  Examples of issues found with RFC 3454

427	4.1.  Dhivehi

429	   Dhivehi, the official language of the Maldives, is written with the
430	   Thaana script.  This script displays some of the characteristics of
431	   Arabic script, including its directional properties, and the
432	   indication of vowels by the diacritical marking of consonantal base
433	   characters.  This marking is obligatory, and both two consecutive
434	   vowels and syllable-final consonants are indicated with unvoiced
435	   combining marks.  Every Dhivehi word therefore ends with a combining
436	   mark.

438	   The word for "computer", which is romanized as "konpeetaru", is
439	   written with the following sequence of Unicode code points:

441	      U+0786 THAANA LETTER KAAFU (AL)

443	      U+07AE THAANA OBOFILI (NSM)

445	      U+0782 THAANA LETTER NOONU (AL)

447	      U+07B0 THAANA SUKUN (NSM)

449	      U+0795 THAANA LETTER PAVIYANI (AL)

451	      U+07A9 THAANA LETTER EEBEEFILI (AL)

453	      U+0793 THAANA LETTER TAVIYANI (AL)

455	      U+07A6 THAANA ABAFILI (NSM)

457	      U+0783 THAANA LETTER RAA (AL)

459	      U+07AA THAANA UBUFILI (NSM)

461	   The directionality class of U+07AA in the Unicode database [Unicode]
462	   is NSM (non-spacing mark), which is not R or AL; a conformant
463	   implementation of the IDNA2003 algorithm will say that "this is not
464	   in RandALCat", and refuse to encode the string.

466	4.2.  Yiddish

468	   Yiddish is one of several languages written with the Hebrew script
469	   (others include Hebrew and Ladino).  This is basically a consonantal
470	   alphabet (also termed an "abjad") but Yiddish is written using an
471	   extended form that is fully vocalic.  The vowels are indicated in
472	   several ways, of which one is by repurposing letters that are
473	   consonants in Hebrew.  Other letters are used both as vowels and
474	   consonants, with combining marks, called "points", used to
475	   differentiate between them.  Finally, some base characters can
476	   indicate several different vowels, which are also disambiguated by
477	   combining marks.  Pointed characters can appear in word-final
478	   position and may therefore also be needed at the end of labels.  This
479	   is not an invariable attribute of a Yiddish string and there is thus
480	   greater latitude here than there is with Dhivehi.

482	   The organization now known as the "YIVO Institute for Jewish
483	   Research" developed orthographic rules for modern Standard Yiddish
484	   during the 1930s on the basis of work conducted in several venues
485	   since earlier in that century.  These are given in, "The Standardized
486	   Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken
487	   as normatively descriptive of modern Standard Yiddish in any context
488	   where that notion is deemed relevant.  They have been applied
489	   exclusively in all formal Yiddish dictionaries published since their
490	   establishment, and are similarly dominant in academic and
491	   bibliographic regards.

493	   It therefore appears appropriate for this repertoire also to be
494	   supported fully by IDNA.  This presents no difficulty with characters
495	   in initial and medial positions, but pointed characters are regularly
496	   used in final position as well.  All of the characters in the SYO
497	   repertoire appear in both marked and unmarked form with one
498	   exception: the HEBREW LETTER PE (U+05E4).  The SYO only permits this
499	   with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent
500	   to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent
501	   to the Latin letter "f".  There is, however, a separate unpointed
502	   allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter
503	   character when it appears in final position.  The constraint on the
504	   use of the SYO repertoire resulting from the proscription of
505	   combining marks at the end of RTL strings thus reduces to nothing
506	   more, or less, than the equivalent of saying that a string of Latin
507	   characters cannot end with the letter "p".  It must also be noted
508	   that the HEBREW LETTER PE with HEBREW POINT DAGESH is characteristic
509	   of almost all traditional Yiddish orthographies that predate (or
510	   remain in use in parallel to) the SYO, being the first pointed
511	   character to appear in any of them.

513	   A more general instantiation of the basic problem can be seen in the
514	   representation of the YIVO acronym.  This acronym is written with the
515	   Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and
516	   QAMATS are combining points.  The Unicode codepoints are:

518	      U+05D9 HEBREW LETTER YOD (R)

520	      U+05B4 HEBREW POINT HIRIQ (NSM)

522	      U+05D5 HEBREW LETTER VAV (R)

524	      U+05D0 HEBREW LETTER ALEF (R)

526	      U+05B8 HEBREW POINT QAMATS (NSM)

528	   The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode
529	   database is NSM, which again causes the IDNA2003 algorithm to reject
530	   the string.

532	   It may also be noted that all of the combined characters mentioned
533	   above exist in precomposed form at separate positions in the Unicode
534	   chart.  However, by invoking Stringprep, the IDNA2003 algorithm also
535	   rejects those codepoints, for reasons not discussed here.

537	4.3.  Strings with numbers

539	   By requiring that the first or last character of a string be category
540	   R or AL, RFC 3454 prohibited a string containing right-to-left
541	   characters from ending with a number.

543	   Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5
544	   ALEF.  Displayed in an LTR context, the first one will be displayed
545	   from left to right as 5 ALEF (with the 5 being considered right-to-
546	   left because of the leading ALEF), while 5 ALEF will be displayed in
547	   exactly the same order (5 taking the direction from context).
548	   Clearly, only one of those should be permitted as a registered label,
549	   but barring them both seems unnecessary.

551	5.  Troublesome situations and guidelines

553	   There are situations in which labels that satisfy the rule above will
554	   be displayed in a surprising fashion.  The most important of these is
555	   the case where a label ending in a character with BIDI property AL,
556	   AN or R occurs before a label beginning with a character of BIDI
557	   property EN.  In that case, the number will appear to move into the
558	   label containing the right-to-left character, violating the Character
559	   Grouping requirement.

561	   If the label that occurs after the right-to-left label itself
562	   satisfies the BIDI criterion, the requirements will be satisfied in
563	   all cases (this is the reason why the criterion talks about strings
564	   containing L in some cases).  However, the WG concluded that this
565	   could not be required for several reasons:

567	   o  There is a large current deployment of ASCII domain names starting
568	      with digits.  These cannot possibly be invalidated.

570	   o  Domain names are often constructed piecemeal, for instance by
571	      combining a string with the content of a search list.  This may
572	      occur after IDNA processing, and thus in part of the code that is
573	      not IDNA-aware, making detection of the undesirable combination
574	      impossible.

576	   o  Even if a label is registered under a "safe" label, there may be a
577	      DNAME [RFC2672] with an "unsafe" label that points to the "safe"
578	      label, thus creating seemingly-valid names that would not satisfy
579	      the criterion.

581	   o  Wildcards create the odd situation where a label is "valid" (can
582	      be looked up successfully) without the zone owner knowing that
583	      this label exists.  So an owner of a zone whose name starts with a
584	      digit and contains a wildcard has no way of controlling whether or
585	      not names with RTL labels in them are looked up in his zone.

587	   Rather than trying to suggest rules that disallow all such
588	   undesirable situations, this document merely warns about the
589	   possibility, and leaves it to application developers to take whatever
590	   measures they deem appropriate to avoid problematic situations.

592	6.  Other issues in need of resolution

594	   This document concerns itself only with the rules that are needed
595	   when dealing with domain names with characters that have differing
596	   BIDI properties, and considers characters only in terms of their BIDI
597	   properties.  All other issues with scripts that are written from
598	   right to left must be considered in other contexts.

600	   One such issue is the need to keep numbers separate.  Several scripts
601	   are used with multiple sets of numbers - most commonly they use Latin
602	   numbers and a script-specific set of numbers, but in the case of
603	   Arabic, there are 2 sets of "Arabic-Indic" digits involved.

605	   The algorithm in this document disallows occurrences of AN-class
606	   characters ("Arabic-Indic digits", U+0660 to U+0669) together with
607	   EN-class characters (which includes "European" digits, U+0030 to
608	   U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but
609	   does not help in preventing the mixing of, for instance, Bengali
610	   digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF),
611	   both of which have BIDI class L. A registry or script community that
612	   wishes to create rules restricting the mixing of digits in a label
613	   will be able to specify these restrictions at the registry level.
614	   Some rules are also specified at the protocol level.

616	   Another set of issues concerns the proper display of IDNs with a
617	   mixture of LTR and RTL labels, or only RTL labels.

619	   It is unrealistic to expect that applications will display domain
620	   names using embedded formatting codes between their labels (for one
621	   thing, no reliable algorithms for identifying domain names in running
622	   text exist); thus, the display order will be determined by the BIDI
623	   algorithm.  Thus, a sequence (in network order) of R1.R2.ltr will be
624	   displayed in the order 2R.1R.ltr in an LTR context, which might
625	   surprise someone expecting to see labels displayed in hierarchical
626	   order.  People used to working with text that mixes LTR and RTL
627	   strings might not be so surprised by this.  Again, this memo does not
628	   attempt to suggest a solution to this problem.

630	7.  Compatibility considerations

632	7.1.  Backwards compatibility considerations

634	   As with any change to an existing standard, it is important to
635	   consider what happens with existing implementations when the change
636	   is introduced.  Some troublesome cases include:

638	   o  Old program used to input the newly-allowed label.  If the old
639	      program checks the input against RFC 3454, some labels will not be
640	      allowed, and domain names containing those labels will remain
641	      inaccessible.

643	   o  Old program is asked to display the newly-allowed label, and
644	      checks it against RFC 3454 before displaying.  The program will
645	      perform some kind of fallback, most likely displaying the label in
646	      A-label form.

648	   o  Old program tries to display the newly-allowed label.  If the old
649	      program has code for displaying the last character of a label that
650	      is different from the code used to display the characters in the
651	      middle of the label, the display may be inconsistent and cause
652	      confusion.

654	   One particular example of the last case is if a program chooses to
655	   examine the last character (in network order) of a string in order to
656	   determine its directionality, rather than its first.  If it finds an
657	   NSM character and tries to display the string as if it was a left-to-
658	   right string, the resulting display may be interesting, but not
659	   useful.

661	   The editors believe that these cases will have less harmful impact in
662	   practice than continuing to deny the use of words from the languages
663	   for which these strings are necessary as IDN labels.

665	   This specification does not forbid using leading European digits in
666	   ASCII-only labels, since this would conflict with a large installed
667	   base of such labels, and would increase the scope of the
668	   specification from RTL labels to all labels.  The harm resulting from
669	   this limitation of scope is described in Section 5.  Registries and
670	   private zone managers can check for this particular condition before
671	   they allow registration of any RTL label.  Generally it is best to
672	   disallow registration of any right-to-left strings in a zone where
673	   the label at the level above begins with a digit.

675	7.2.  Forward compatibility considerations

677	   This text is intentionally specified strictly in terms of the Unicode
678	   BIDI properties.  The determination that the condition is sufficient
679	   to fulfil the criteria depends on the Unicode BIDI algorithm; it is
680	   unlikely that drastic changes will be made to this algorithm.

682	   However, the determination of validity for any string depends on the
683	   Unicode BIDI property values, which are not declared immutable by the
684	   Unicode Consortium.  Furthermore, the behaviour of the algorithm for
685	   any given character is likely to be linguistically and culturally
686	   sensitive, so that while it should occur rarely, it is possible that
687	   later versions of the Unicode standard may change the BIDI properties
688	   assigned to certain Unicode characters.

690	   This memo does not propose a solution for this problem.

692	8.  IANA Considerations

694	   This document makes no request of IANA.

696	   Note to RFC Editor: this section may be removed on publication as an
697	   RFC.

699	9.  Security Considerations

701	   The display behaviour of mixed-direction text can be extremely
702	   surprising to users who are not used to it; for instance, cut and
703	   paste of a piece of text can cause the text to display differently at
704	   the destination, if the destination is in another directionality
705	   context, and adding a character in one place of a text can cause
706	   characters some distance from the point of insertion to change their
707	   display position.  This is, however, not a phenomenon unique to the
708	   display of domain names.

710	   The new IDNA protocol, and particularly these new BIDI rules, will
711	   allow some strings to be used in IDNA contexts that are not allowed
712	   today.  It is possible that differences in the interpretation of
713	   labels between implementations of IDNA2003 and IDNA2008 could pose a
714	   security risk, but it is difficult to envision any specific
715	   instantiation of this.

717	   Any rational attempt to compute, for instance, a hash over an
718	   identifier processed by IDNA would use network order for its
719	   computation, and thus be unaffected by the new rules proposed here.

721	   While it is not believed to pose a problem, if display routines had
722	   been written with specific knowledge of the RFC 3454 IDNA
723	   prohibitions, it is possible that the potential problems noted under
724	   "backwards compatibility" could cause new kinds of confusion.

726	10.  Acknowledgements

728	   While the listed editors held the pen, this document represents the
729	   joint work and conclusions of an ad hoc design team.  In addition to
730	   the editors this consisted of, in alphabetic order, Tina Dam, Patrik
731	   Faltstrom, and John Klensin.  Many further specific contributions and
732	   helpful comments were received from the people listed below, and
733	   others who have contributed to the development and use of the IDNA
734	   protocols.

736	   The particular formulation of the BIDI rule in section 2 was
737	   suggested by Matitiahu Allouche.

739	   The team wishes in particular to thank Roozbeh Pournader for calling
740	   its attention to the issue with the Thaana script, Paul Hoffman for
741	   pointing out the need to be explicit about backwards compatibility
742	   considerations, Ken Whistler for suggesting the basis of the
743	   formalized "character grouping" requirement, Mark Davis for
744	   commentary, Erik van der Poel for careful review, comments and
745	   verification of the rulesets, Marcos Sanz, Andrew Sullivan and Pete
746	   Resnick for reviews, and Vint Cerf for chairing the working group and
747	   contributing massively to getting the documents finished.

749	11.  References
750	11.1.  Normative references

752	   [I-D.ietf-idnabis-defs]
753	              Klensin, J., "Internationalized Domain Names for
754	              Applications (IDNA): Definitions and Document Framework",
755	              draft-ietf-idnabis-defs-12 (work in progress),
756	              October 2009.

758	   [UAX9]     Davis, M., "Unicode Standard Annex #9: The Bidirectional
759	              Algorithm, revision 19", 03 2008.

761	   [Unicode]  Unicode, "The Unicode Standard - version 5.2", 2008.

763	11.2.  Informative references

765	   [I-D.ietf-idnabis-protocol]
766	              Klensin, J., "Internationalized Domain Names in
767	              Applications (IDNA): Protocol",
768	              draft-ietf-idnabis-protocol-17 (work in progress),
769	              October 2009.

771	   [RFC2672]  Crawford, M., "Non-Terminal DNS Name Redirection",
772	              RFC 2672, August 1999.

774	   [RFC3454]  Hoffman, P. and M. Blanchet, "Preparation of
775	              Internationalized Strings ("stringprep")", RFC 3454,
776	              December 2002.

778	   [SYO]      "The Standardized Yiddish Orthography: Rules of Yiddish
779	              Spelling, 6th ed., , New York, ISBN 0-914512-25-0",",
780	              1999.

782	Appendix A.  Change log

784	   This appendix is intended to be removed by the RFC Editor when this
785	   document is published as an RFC.

787	A.1.  Changes from draft-alvestrand-00 to -01

789	   Suggested a possible new algorithm.

791	   Multiple smaller changes.

793	A.2.  Changes from alvestrand-01 to -02

795	   Date of publication updated.

797	   Change log added.

799	A.3.  Changes from alvestrand-02 to -03

801	   Intro changed to reflect addressing the deeper issues with the BIDI
802	   algorithm.

804	   Gave formalized criteria for "valid strings", and documented the new
805	   set of requirements for strings that satisfy the criteria.

807	   Removed most of section 5, "Other problems", and noted that this memo
808	   focuses ONLY on issues that can be evaluated by looking at the BIDI
809	   properties of characters.

811	A.4.  Changes from alvestrand-03 to -04

813	   Added back AN to the list of allowed characters; it had been left out
814	   by accident in -03.

816	   Removed some rules that were redundant.

818	   Added some considerations for backwards compatibility and interaction
819	   with ASCII labels that start with a number.

821	   Mentioned the issue with DNAME pointing to a zone containing RTL
822	   labels in the security considerations section.

824	   Wording updates in multiple places, including some spelling errors.

826	   Rewrote the introduction section.

828	   Split references into "normative" and "informative".

830	A.5.  Changes from draft-alvestrand-04 to draft-ietf -00

832	   Changed name of draft.

834	   Added a couple of "note in draft" statements to remind the WG of open
835	   issues.

837	   Noted that BIDI controls in the paragraph are unproblematic with the
838	   given ruleset.

840	A.6.  Changes from idnabis -00 to -01

842	   Added text to section 5 describing issues with mixture of numbers in
843	   labels
844	   Addressed some of the issues raised by Mark Davis in March 2008 in
845	   regard to document clarity.

847	   Changed the formulation of the label uniqueness requirement to be
848	   consistent with the text under "Labels with numbers".

850	   Spell-checked document.

852	A.7.  Changes from idnabis -01 to -02

854	   Changed the domain of applicability to be only labels containing RTL
855	   characters, described the conditions under which harm may result from
856	   putting RTL labels next to other labels, and how to detect them.

858	   A number of clarification and formatting changes in response to
859	   reviews.

861	A.8.  Changes from idnabis -02 to -03

863	   Rearranged section list so that the normative material is collected
864	   at the front.

866	   Moved list of BIDI properties into "terminology"

868	   Clarified that only terminology and the BIDI rule is normative

870	   Changed reference to point to -defs for definitions instead of
871	   -rationale

873	   Minor fixes in response to comments, wording cleanups, removed all
874	   tentative language.

876	A.9.  Changes from idnabis -03 to -04

878	   Updated to new IPR rules.

880	   Minor textual clarifications.

882	   Replaced the BIDI test with a version suggested by Matitiahu Allouche
883	   - this description is simpler to understand than the one in -03, and
884	   generates a larger set of allowable strings, while all tests indicate
885	   that they still pass all the criteria.

887	A.10.  Changes from idnabis -04 to -05

889	   Minor textual clarifications resulting from WG Last Call.  No
890	   technical changes.

892	   Updated UAX9 reference to Unicode 5.1 version.

894	   Made better use of some terminology, and clarified the relationship
895	   with RFC 3454 based on input from Paul Hoffman.

897	   Added examples of newly-forbidden labels, based on advice from Andrew
898	   Sullivan

900	A.11.  Changes from idnabis -05 to -06

902	   Most of these changes are based on a review by Martin Duerst.

904	   Rewrote abstract.

906	   Changed "test" to "rule" throughout, with accompanying minor tweaks

908	   Re-allowed BN in LTR labels (error introduced in -04).

910	   Added words to explain role of BN more (in the requirements section).

912	   Modified the words about the effect of BIDI changes after having
913	   reassurance that changes are likely to be rare.

915	   Minor textual fixes.

917	A.12.  Changes from idnabis -06 to -07

919	   Added a note in the intro saying explicitly that other parts of
920	   IDNABIS specify which characters are legal (in response to a Last
921	   Call comment from Joel Halpern).

923	   Inserted an explicit pointer to Dhivehi and a couple of other
924	   clarifying changes to the (non-normative) section 4.

926	   Mentioned Vint Cerf in the acknowledgements.

928	Authors' Addresses

930	   Harald Tveit Alvestrand (editor)
931	   Google
932	   Beddingen 10
933	   Trondheim,   7014
934	   Norway

936	   Email: harald@alvestrand.no
937	   Cary Karp
938	   Swedish Museum of Natural History
939	   Frescativ. 40
940	   Stockholm,   10405
941	   Sweden

943	   Phone: +46 8 5195 4055
944	   Fax:
945	   Email: ck@nrm.museum
946	   URI: