idnits 2.17.1 

draft-alvestrand-idna-bidi-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 702.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 713.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 720.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 726.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 99: '...      character MUST be the first char...'
     RFC 2119 keyword, line 100: '...dALCat character MUST be the last char...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (Feb 14, 2008) is 5916 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Normative reference to a draft: ref.
     'I-D.klensin-idnabis-issues' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX9'

  -- Obsolete informational reference (is this intentional?): RFC 3454
     (Obsoleted by RFC 7564)


     Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                 H. Alvestrand, Ed.
3	Internet-Draft                                                    Google
4	Intended status: Standards Track                            C. Karp, Ed.
5	Expires: August 17, 2008               Swedish Museum of Natural History
6	                                                            Feb 14, 2008

8	          An updated IDNA criterion for right-to-left scripts
9	                     draft-alvestrand-idna-bidi-04

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on August 17, 2008.

36	Copyright Notice

38	   Copyright (C) The IETF Trust (2008).

40	Abstract

42	   The use of right-to-left scripts in internationalized domain names
43	   has presented several challenges.  This memo discusses some problems
44	   with these scripts, and some shortcomings in the 2003 IDNA BIDI
45	   criterion.  Based on this discussion, it proposes a new BIDI
46	   criterion for IDNA labels.

48	Table of Contents

50	   1.  Introduction and problem description . . . . . . . . . . . . .  3
51	     1.1.  Purpose and applicability  . . . . . . . . . . . . . . . .  3
52	     1.2.  Background and history . . . . . . . . . . . . . . . . . .  3
53	     1.3.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  4
54	   2.  Detailed examples  . . . . . . . . . . . . . . . . . . . . . .  4
55	     2.1.  Dhivehi  . . . . . . . . . . . . . . . . . . . . . . . . .  4
56	     2.2.  Yiddish  . . . . . . . . . . . . . . . . . . . . . . . . .  5
57	     2.3.  Strings with numbers . . . . . . . . . . . . . . . . . . .  6
58	   3.  An expanded justification for the bidi rule  . . . . . . . . .  7
59	   4.  A replacement for the RFC 3454 criterion . . . . . . . . . . . 10
60	   5.  Other issues in need of resolution . . . . . . . . . . . . . . 11
61	   6.  Compatibility considerations . . . . . . . . . . . . . . . . . 11
62	     6.1.  Backwards compatibility considerations . . . . . . . . . . 11
63	     6.2.  Forward compatibiltiy considerations . . . . . . . . . . . 12
64	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 13
65	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 13
66	   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13
67	   Appendix A.  Change log  . . . . . . . . . . . . . . . . . . . . . 14
68	     A.1.  Changes from -00 to -01  . . . . . . . . . . . . . . . . . 14
69	     A.2.  Changes from -01 to -02  . . . . . . . . . . . . . . . . . 14
70	     A.3.  Changes from -02 to -03  . . . . . . . . . . . . . . . . . 14
71	     A.4.  Changes from -03 to -04  . . . . . . . . . . . . . . . . . 14
72	   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 15
73	     10.1. Normative references . . . . . . . . . . . . . . . . . . . 15
74	     10.2. Informative references . . . . . . . . . . . . . . . . . . 15
75	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 15
76	   Intellectual Property and Copyright Statements . . . . . . . . . . 17

78	1.  Introduction and problem description

80	1.1.  Purpose and applicability

82	   This document's purpose is to establish a test that can be applied to
83	   Internationalized Domain Name (IDN) labels in Unicode form (U-labels)
84	   containing right-to-left characters.

86	   When labels pass the test, they can be used with a minimal chance of
87	   these labels being displayed in a confusing way by a bidirectional
88	   display algorithm.  In order to achieve this stability, it is also
89	   necessary that the test be applied to labels occuring before or after
90	   the label containing right-to-left characters, which prohibits some
91	   LDH-labels that are permitted in other contexts.

93	1.2.  Background and history

95	   The IDNA specification "Stringprep", [RFC3454] makes the following
96	   statement in its section 6 on the bidi algorithm, :

98	      3) If a string contains any RandALCat character, a RandALCat
99	      character MUST be the first character of the string, and a
100	      RandALCat character MUST be the last character of the string.

102	   (A RandAlCat character is a character with unambiguously right-to-
103	   left directionality.)

105	   The reasoning behind this prohibition was to ensure that every
106	   component of a displayed domain name has an unambiguously preferred
107	   direction.  However, this makes certain words in languages written
108	   with right-to-left scripts invalid as IDN labels, and in at least one
109	   case means that all the words of an entire language are forbidden as
110	   IDN labels.

112	   This will be illustrated below with examples taken from the Dhivehi
113	   and Yiddish languages, as written with the Thaana and Hebrew scripts,
114	   respectively.

116	   In investigating this problem, it was realized that the RFC 3454
117	   specification did not exactly specify what the requirement to be
118	   fulfilled was, and therefore, it was impossible to tell whether a
119	   simple relaxation of the rule would continue to fulfil the
120	   requirement.  A further investigation led to the conclusion that for
121	   one reasonable set of requirements, IDNA2003's BIDI restriction did
122	   not fulfil the requirements.  This document therefore proposes
123	   replacing the RFC 3454 BIDI requirement in its entirety.

125	   While the document proposes completely new text, most reasonable
126	   labels that were allowed under the old criterion will also be allowed
127	   under the new criterion, so the operational impact of the rule change
128	   is limited.

130	1.3.  Terminology

132	   In this memo, we use "network order" to describe the sequence of
133	   characters as transmitted on the wire or stored in a file; the terms
134	   "first", "next" and "previous" are used to refer to the relationship
135	   of characters in network order.

137	   We use "display order" to talk about the sequence of characters as
138	   imaged on a display medium; the terms "left" and "right" are used to
139	   refer to the relationship of characters in display order.

141	   Most of the time, the examples use the abbreviations for the Unicode
142	   Bidi classes to denote the directionality of the characters; in some
143	   examples, the convention that uppercase characters are of class R or
144	   AL, and lowercase characters are of class L is used - thus, the
145	   example string ABC.abc would consist of 3 right-to-left characters
146	   and 3 left-to-right characters.

148	   The other terminology used to describe IDNA concepts is defined in
149	   [I-D.klensin-idnabis-issues]

151	2.  Detailed examples

153	2.1.  Dhivehi

155	   Dhivehi, the official language of the Maldives, is written with the
156	   Thaana script.  This displays some of the characteristics of Arabic
157	   script, including its directional properties, and the indication of
158	   vowels by the diacritical marking of consonantal base characters.
159	   This marking is obligatory, and both double vowels and syllable-final
160	   consonants are indicated by the marking of special unvoiced
161	   characters.  Every Dhivehi word therefore ends with a combining mark.

163	   The word for "computer", which is romanized as "konpeetaru", is
164	   written with the following sequence of Unicode code points:

166	      U+0786 THAANA LETTER KAAFU (AL)

168	      U+07AE THAANA OBOFILI (NSM)

170	      U+0782 THAANA LETTER NOONU (AL)
171	      U+07B0 THAANA SUKUN (NSM)

173	      U+0795 THAANA LETTER PAVIYANI (AL)

175	      U+07A9 THAANA LETTER EEBEEFILI (AL)

177	      U+0793 THAANA LETTER TAVIYANI (AL)

179	      U+07A6 THAANA ABAFILI (NSM)

181	      U+0783 THAANA LETTER RAA (AL)

183	      U+07AA THANAA UBIUFILI (NSM)

185	   The directionality class of U+07AA in the Unicode database is NSM
186	   (non-spacing mark), which is not R or AL; a conformant implementation
187	   of the IDNA2003 algorithm will say that "this is not in RandALCat",
188	   and refuse to encode the string.

190	2.2.  Yiddish

192	   Yiddish is one of several languages written with the Hebrew script
193	   (others include Hebrew and Ladino).  This is basically a consonantal
194	   alphabet (also termed an "abjad") but Yiddish is written using an
195	   extended form that is fully vocalic.  The vowels are indicated in
196	   several ways, of which one is by repurposing letters that are
197	   consonants in Hebrew.  Other letters are used both as vowels and
198	   consonants, with combining marks, called "points", used to
199	   differentiate between them.  Finally, some base characters can
200	   indicate several different vowels, which are also disambiguated by
201	   combining marks.  Pointed characters can appear in word-final
202	   position and may therefore also be needed at the end of labels.  This
203	   is not an invariable attribute of a Yiddish string and there is thus
204	   greater latitude here than there is with Dhivehi.

206	   The organization now known as the "YIVO Institute for Jewish
207	   Research" developed orthographic rules for modern Standard Yiddish
208	   during the 1930s on the basis of work conducted in several venues
209	   since earlier in that century.  These are given in, "The Standardized
210	   Yiddish Orthography: Rules of Yiddish Spelling, 6th ed., YIVO
211	   Institute for Jewish Research, New York, 1999, ISBN 0-914512-25-0",
212	   ("SYO") and are taken as normatively descriptive of modern Standard
213	   Yiddish in any context where that notion is deemed relevant.  They
214	   have been applied exclusively in all Yiddish dictionaries published
215	   since their establishment, and are similarly dominant in academic and
216	   bibliographic regards.

218	   It therefore appears appropriate for this repertoire also to be
219	   supported fully by IDNA.  This presents no difficulty with characters
220	   in initial and medial positions, but pointed characters are regularly
221	   used in final position as well.  All of the characters in the SYO
222	   repertoire appear in both marked and unmarked form with one
223	   exception: the HEBREW LETTER PE (U+05E4).  The SYO only permits this
224	   with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent
225	   to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent
226	   to the Latin letter "f".  There is, however, a separate unpointed
227	   allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter
228	   character when it appears in final position.  The constraint on the
229	   use of the SYO repertoire resulting from the proscription of
230	   combining marks at the end of RTL strings thus reduces to nothing
231	   more, or less, than the equivalent of saying that a string of Latin
232	   characters cannot end with the letter "p".  It must also be noted
233	   that the HEBREW LETTER PE with HEBREW POINT DAGESH is characteristic
234	   of almost all traditional Yiddish orthographies that predate (or
235	   remain in use in parallel to) the SYO, being the first pointed
236	   character to appear in any of them.

238	   A more general instantiation of the basic problem can be seen in the
239	   representation of the YIVO acronym.  This is written with the Hebrew
240	   letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and QAMATS are
241	   combining points:

243	      U+05D9 HEBREW LETTER YOD (R)

245	      U+05B4 HEBREW POINT HIRIQ (NSM)

247	      U+05D5 HEBREW LETTER VAV (R)

249	      U+05D0 HEBREW LETTER ALEF (R)

251	      U+05B8 HEBREW POINT QAMATS (NSM)

253	   The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode
254	   database is NSM, which again causes the IDNA2003 algorithm to reject
255	   the string.

257	   It may also be noted that all of the combined characters mentioned
258	   above exist in precomposed form at separate positions in the Unicode
259	   chart.  However, by invoking Stringprep, the IDNA2003 algorithm also
260	   rejects those codepoints, for reasons not discussed here.

262	2.3.  Strings with numbers

264	   RFC 3454, in its insistence that the first or last character of a
265	   string be category R or AL, prohibited strings that contained right-
266	   to-left characters and numbers at the end.

268	   Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5
269	   ALEF.  Displayed in a LTR context, the first one will be displayed
270	   from left to right as 5 ALEF (with the 5 being considered right-to-
271	   left because of the leading ALEF), while 5 ALEF will be displayed in
272	   exactly the same order (5 taking the direction from context).
273	   Clearly, only one of those should be permitted as a registered label.

275	3.  An expanded justification for the bidi rule

277	   One issue with RFC 3454 was that it did not give an explicit
278	   justification for the bidi rule, thus it was hard to tell if a
279	   modified rule would continue to fulfil the purpose for which the RFC
280	   3454 rule was written.

282	   This document proposes an explicit justification, by stating a set of
283	   requirements for which it is possible to test whether or not the
284	   modified rule fulfils the requirement.

286	   All the text in this document assumes that text containing the labels
287	   under consideration will be displayed using the Unicode bidirectional
288	   algorithm [UAX9].

290	   The justification proposed is this:

292	   o  No two labels, when presented in display order, should have the
293	      same sequence of characters without also having the same sequence
294	      of characters in network order.  (This is the criterion that is
295	      explicit in RFC 3454).

297	   o  In a display of a string of labels, the characters of each label
298	      should remain grouped between the characters delimiting the
299	      labels.

301	   o  These properties should hold true both when the string is embedded
302	      in a paragraph with LTR direction and when it's embedded in a
303	      paragraph with RTL direction, as long as explicit directional
304	      controls are not used within the same paragraph.

306	   Several stronger statements were considered and rejected, because
307	   they seem to be impossible to fulfil within the constraints of the
308	   Unicode bidirectional algorithm.  These include:

310	   o  The appearance of a label should be unaffected by its embedding
311	      context.  This proved impossible even for ASCII labels; the label
312	      "123-456" will have a different display order in an RTL context
313	      than in a LTR context.

315	   o  The sequence of labels should be consistent with network order.
316	      This proved impossible - a domain name consisting of the labels
317	      (in network order) L1.R1.R2.L2 will be displayed as L1.R2.R1.L2 in
318	      an LTR context.

320	   o  The "remain grouped" property should remain true when directional
321	      controls (LRE, RLE, RLO, LRO, PDF) are used in the same paragraph
322	      (outside of the labels).  Because these controls affect
323	      presentation order in non-obvious ways, by affecting the "sor" and
324	      "eor" properties of the Unicode BIDI algorithm, the conditions
325	      above would be very hard to satisfy for an useful set of strings
326	      if this was true.  As long as these controls have no influence
327	      over the display of the domain name, no problem will be caused,
328	      but the exact criterion for "will not influence" is hard to
329	      codify.

331	   o  The "no two labels display the same" should hold true between LTR
332	      paragraphs and RTL paragraphs.  This was shown to be unsound.

334	   o  No two domain names should be displayed the same, even under
335	      differing directionality.  This was shown to be unsound, since the
336	      domain name (network) ABC.abc will have display order CBA.abc in
337	      an LTR context and abc.CBA in an RTL context, while the domain
338	      name (network) abc.ABC will display as abc.CBA in an LTR context
339	      and as CBA.abc in an RTL context.

341	   For reference, here are the values that the Unicode BIDI property can
342	   have:

344	   o  L - Left-to-right - most letters in LTR scripts

346	   o  R - Right-to-left - most letters in non-Arabic RTL scripts

348	   o  AL - Arabic letters - most letters in the Arabic script

350	   o  EN - European Number (0-9)

352	   o  ES - European Number Separator (+ and -)

354	   o  ET - European Number Terminator (currency symbols, the hash sign,
355	      the percent sign and so on)

357	   o  AN - Arabic Number

359	   o  CS - Common Number Separator (. , / : et al)

361	   o  NSM - Nonspacing Mark - most combining accents
362	   o  BN - Boundary Neutral - control characters

364	   o  B - Paragraph Separator

366	   o  S - Segment Separator

368	   o  WS - Whitespace, including the SPACE character

370	   o  ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT

372	   o  LRE, LRO, RLE, RLO, PDF - these are "directional control
373	      characters", and are not used in IDNA labels.

375	   The "remain grouped" property can be more formally stated as:

377	   o  Let "Delimiter chars" be a set of characters with the Unicode BIDI
378	      properties CS, WS, ON.  (These are commonly used to delimit labels
379	      - both the FULL STOP and the space are included.)

381	      *  ET, though it commonly occurs next to domain names in practice,
382	         is problematic: the context R CS L EN ET (for instance A.a1%)
383	         makes the label L EN grow unstable.

385	      *  ES commonly occurs in labels as HYPHEN-MINUS, but could also be
386	         used as a delimiter (for instance, the plus sign).  It is left
387	         out here.

389	   o  Let "Position" be the position of a character in a string (in
390	      network order)

392	   o  Let "Bidi position" be the position computed by the Unicode Bidi
393	      algorithm

395	   In a paragraph with an embedded string formed from the substrings A B
396	   L C D, where A and D are (possibly zero-length) legal labels, and B
397	   and C are single "Delimiter chars", the label L is a legal label if,
398	   for all A, B, C and D, the bidi position of all characters in L is
399	   within the range of positions for the characters of L in the string,
400	   for both the LTR and RTL paragraph direction.

402	   (The "zero-length" case represents the case where a domain name is
403	   next to something that isn't a domain name, separated by a delimiter
404	   character).

406	   The "No two labels" property can be formally stated as:

408	   If two labels L and L', embedded as for the test above, displayed in
409	   a paragraph with the same directionality, are rearranged into the
410	   same sequence of codepoints, neither L nor L' is a legal label.

412	4.  A replacement for the RFC 3454 criterion

414	   A set of rules that satisfies the tests above is as follows.  The
415	   main bullets give the rule, subordinate bullets (if any) give
416	   justifications or examples of things that break if this rule is not
417	   present.  The term "unstable" means that it fails to satisfy the
418	   "remain grouped" property defined above.

420	   Exhaustive testing has verified that strings that satisfy this
421	   criterion satisfy both the requirements above at least for all
422	   strings up to 6 characters.

424	   o  Only characters with the BIDI properties L, R, AL, AN, EN, ES, BN,
425	      ON and NSM are allowed.

427	      *  B, S and WS are excluded because they are separators or spaces.

429	      *  LRE, LRO, RLE, RLO, PDF are excluded because they are bidi
430	         controls.

432	      *  ET is excluded because the string L ET is unstable.

434	      *  CS is excluded because the string L CS is unstable.

436	   o  ES and ON are not allowed in the first position

438	      *  ES R and ON R are both unstable.

440	   o  ES and ON, followed by zero or more NSM, is not allowed in the
441	      last position

443	      *  L ON and L ES are both unstable.

445	   o  If an L is present, no R, AL or AN may be present, and vice versa.

447	   o  If an EN is present, no AN may be present, and vice versa.

449	   o  The first character may not be an NSM

451	   o  The first character may not be an EN (European Number) or an AN
452	      (Arabic Number).

454	      *  If the character on both sides of a CS is an EN or an AN, the
455	         labels turn unstable.

457	      *  Some domain names where some of the labels use leading EN and
458	         AN may be problem-free, but there's no way of verifying this
459	         while looking at a single label in isolation.

461	      *  NOTE: This is a restriction on ASCII labels when used together
462	         with IDNA labels.  This is a change from the existing rules for
463	         ASCII labels.

465	      *  We could achieve stability by barring numbers at the end of
466	         labels, but this may be more disruptive in practice.

468	5.  Other issues in need of resolution

470	   This document concerns itself only with the rules that are needed
471	   when dealing with domain names with characters that have differing
472	   Bidi properties, and considers characters only in terms of their Bidi
473	   properties.  All other issues with these scripts have to be
474	   considered in other contexts.

476	   Another set of issues concerns the proper display of IDNs with a
477	   mixture of LTR and RTL labels, or only RTL labels.

479	   It is unrealistic to expect that domain names will be written using
480	   embedded formatting codes between their labels; thus, the display
481	   order will be determined by the bidirectional algorithm.  Thus, a
482	   sequence (in network order) of R1.R2.ltr will be displayed in the
483	   order 2R.1R.ltr in a LTR context, which might surprise someone
484	   expecting to see labels displayed in hierarchical order.  Again, this
485	   memo does not attempt to suggest a solution to this problem.

487	6.  Compatibility considerations

489	6.1.  Backwards compatibility considerations

491	   As with any change to an existing standard, it is important to
492	   consider what happens with existing implementations when the change
493	   is introduced.  The following troublesome cases have been noted:

495	   o  Old program used to input the newly allowed string.  If the old
496	      program checks the input against RFC 3454, the string will not be
497	      allowed, and that domain name will remain inaccessible.

499	   o  Old program is asked to display the newly allowed string, and
500	      checks it against RFC 3454 before displaying.  The program will
501	      perform some kind of fallback, most likely displaying the Punycode
502	      form of the string.

504	   o  Old program tries to display the newly allowed string.  If the old
505	      program has code for displaying the last character of a string
506	      that is different from the code used to display the characters in
507	      the middle of the string, display may be inconsistent and cause
508	      confusion.

510	   One particular example of the last case is if a program chooses to
511	   examine the last character (in network order) of a string in order to
512	   determine its directionality, rather than its first; if it finds an
513	   NSM character and tries to display the string as if it was a left-to-
514	   right string, the resulting display may be interesting, but not
515	   useful.

517	   The editors believe that these cases will have less harmful impact in
518	   practice than continuing to deny the use of words from the languages
519	   for which these strings are necessary as IDN labels.

521	   This specification forbids using leading European numbers in ASCII-
522	   only labels; this is in conflict with a large installed base of such
523	   labels.  The harm resulting from violating this rule is seen when a
524	   label at the next level down in the hierarchy ends with a number
525	   (Arabic or European).  Zone managers, both registries and private
526	   zone managers, can check for this particular condition before they
527	   allow registration of any string with right-to-left characters in it;
528	   generally it is best to not allow registration of any right-to-left
529	   strings in a zone where the label at the level above begins with a
530	   digit.

532	6.2.  Forward compatibiltiy considerations

534	   This text is, intentionally, specified strictly in terms of the
535	   Unicode BIDI properties.  The determination that the condition is
536	   sufficient to fulfil the criteria depends on the Unicode BIDI
537	   algorithm; it is unlikely that drastic changes will be made to this
538	   algorithm.

540	   However, the determination of validity for any string depends on the
541	   Unicode BIDI property values, which are not declared immutable by the
542	   Unicode Consortium.  Furthermore, the behaviour of the algorithm for
543	   any given character is likely to be linguistically and culturally
544	   sensitive, so that it's not unlikely that later versions of the
545	   Unicode standard may change the bidi properties assigned to certain
546	   Unicode characters.

548	   This memo does not propose a solution for this problem.

550	7.  IANA Considerations

552	   This document makes no request of IANA.

554	   Note to RFC Editor: this section may be removed on publication as an
555	   RFC.

557	8.  Security Considerations

559	   This modification will allow some strings to be used in Stringprep
560	   contexts that are not allowed today.  It is possible that differences
561	   in the interpretation of the specification between old and new
562	   implementations could pose a security risk, but it is difficult to
563	   envision any specific instantiation of this.

565	   Any rational attempt to compute, for instance, a hash over an
566	   identifier processed by Stringprep would use network order for its
567	   computation, and thus be unaffected by the changes proposed here.

569	   While it is not believed to pose a problem, if display routines had
570	   been written with specific knowledge of the RFC 3454 Stringprep
571	   prohibitions, it is possible that the potential problems noted under
572	   "backwards compatibility" could cause new kinds of confusion.

574	   The rule about leading numbers, which is more restrictive than
575	   current practice for domain names, has a peculiar interaction with
576	   the DNAME record; a DNAME record can point to a zone where right-to-
577	   left labels are registered without the knowledge or consent of the
578	   zone owner; if the name of the DNAME begins with a number, this can
579	   cause display of the right-to-left labels in the zone to be
580	   confusing.  It is recommended that DNAMEs pointing to zones allowing
581	   right-to-left labels should not start with a digit, but a pointed-to
582	   zone owner has no way of enforcing this.

584	9.  Acknowledgements

586	   While the listed editors held the pen, this document represents the
587	   joint work and conclusions of an ad hoc design team.  In addition to
588	   the editors this consisted of, in alphabetic order, Tina Dam, Patrik
589	   Faltstrom, and John Klensin.  Many further specific contributions and
590	   helpful comments were received from the people listed below, and
591	   others who have contributed to the development and use of the IDNA
592	   protocols.

594	   The team wishes in particular to thank Roozbeh Pournader for calling
595	   its attention to the issue with the Thaana script, Paul Hoffmann for
596	   pointing out the need to be explicit about backwards compatibility
597	   considerations, Ken Whistler for suggesting the basis of the
598	   formalized "remain grouped" requirement, and Erik van der Poel for
599	   careful review, comments and verification of the rulesets.

601	Appendix A.  Change log

603	   This appendix is intended to be removed when this document is
604	   published as an RFC.

606	A.1.  Changes from -00 to -01

608	   Suggested a possible new algorithm.

610	   Multiple smaller changes.

612	A.2.  Changes from -01 to -02

614	   Date of publication updated.

616	   Change log added.

618	A.3.  Changes from -02 to -03

620	   Intro changed to reflect addressing the deeper issues with the Bidi
621	   algorithm.

623	   Gave formalized criteria for "valid strings", and documented the new
624	   set of requirements for strings that satisfy the criteria.

626	   Removed most of section 5, "Other problems", and noted that this memo
627	   focuses ONLY on issues that can be evaluated by looking at the bidi
628	   properties of characters.

630	A.4.  Changes from -03 to -04

632	   Added back AN to the list of allowed characters; it had been left out
633	   by accident in -03.

635	   Removed some rules that were redundant.

637	   Added some considerations for backwards compatibility and interaction
638	   with ASCII labels that start with a number.

640	   Mentioned the issue with DNAME pointing to a zone containing RTL
641	   labels in the security considerations section.

643	   Wording updates in multiple places, including some spelling errors.

645	   Rewrote the introduction section.

647	   Split references into "normative" and "informative".

649	10.  References

651	10.1.  Normative references

653	   [I-D.klensin-idnabis-issues]
654	              Klensin, J., "Internationalizing Domain Names for
655	              Applications (IDNA): Issues,  Explanation, and Rationale",
656	              draft-klensin-idnabis-issues-07 (work in progress),
657	              February 2008.

659	   [UAX9]     Davis, M., "Unicode Standard Annex #9: The Bidirectional
660	              Algorithm, revision 15", 03 2005.

662	10.2.  Informative references

664	   [RFC3454]  Hoffman, P. and M. Blanchet, "Preparation of
665	              Internationalized Strings ("stringprep")", RFC 3454,
666	              December 2002.

668	Authors' Addresses

670	   Harald Tveit Alvestrand (editor)
671	   Google
672	   Beddingen 10
673	   Trondheim,   7014
674	   Norway

676	   Email: harald@alvestrand.no
677	   Cary Karp (editor)
678	   Swedish Museum of Natural History
679	   Frescativ. 40
680	   Stockholm,   10405
681	   Sweden

683	   Phone: +46 8 5195 4055
684	   Fax:
685	   Email: ck@nrm.museum
686	   URI:

688	Full Copyright Statement

690	   Copyright (C) The IETF Trust (2008).

692	   This document is subject to the rights, licenses and restrictions
693	   contained in BCP 78, and except as set forth therein, the authors
694	   retain all their rights.

696	   This document and the information contained herein are provided on an
697	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
698	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
699	   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
700	   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
701	   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
702	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

704	Intellectual Property

706	   The IETF takes no position regarding the validity or scope of any
707	   Intellectual Property Rights or other rights that might be claimed to
708	   pertain to the implementation or use of the technology described in
709	   this document or the extent to which any license under such rights
710	   might or might not be available; nor does it represent that it has
711	   made any independent effort to identify any such rights.  Information
712	   on the procedures with respect to rights in RFC documents can be
713	   found in BCP 78 and BCP 79.

715	   Copies of IPR disclosures made to the IETF Secretariat and any
716	   assurances of licenses to be made available, or the result of an
717	   attempt made to obtain a general license or permission for the use of
718	   such proprietary rights by implementers or users of this
719	   specification can be obtained from the IETF on-line IPR repository at
720	   http://www.ietf.org/ipr.

722	   The IETF invites any interested party to bring to its attention any
723	   copyrights, patents or patent applications, or other proprietary
724	   rights that may cover technology that may be required to implement
725	   this standard.  Please address the information to the IETF at
726	   ietf-ipr@ietf.org.

728	Acknowledgment

730	   Funding for the RFC Editor function is provided by the IETF
731	   Administrative Support Activity (IASA).