idnits 2.17.1 

draft-alvestrand-idna-bidi-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 501.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 512.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 519.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 525.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 77: '...      character MUST be the first char...'
     RFC 2119 keyword, line 78: '...dALCat character MUST be the last char...'
     RFC 2119 keyword, line 313: '...owing conditions MUST be true in both ...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (Jan 9, 2008) is 5951 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 3454 (Obsoleted by RFC 7564)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX9'


     Summary: 4 errors (**), 0 flaws (~~), 2 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                 H. Alvestrand, Ed.
3	Internet-Draft                                                    Google
4	Intended status: Standards Track                            C. Karp, Ed.
5	Expires: July 12, 2008                 Swedish Museum of Natural History
6	                                                             Jan 9, 2008

8	                An IDNA problem in right-to-left scripts
9	                     draft-alvestrand-idna-bidi-02

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on July 12, 2008.

36	Copyright Notice

38	   Copyright (C) The IETF Trust (2008).

40	Abstract

42	   The use of right-to-left scripts in internationalized domain names
43	   has presented several challenges.  This memo discusses some problems
44	   with these scripts, including one resulting from a constraint on the
45	   use of combining characters at the end of an RTL domain label,
46	   causing some words to be declared invalid as IDN labels, and proposes
47	   a means for ameliorating this problem.

49	Table of Contents

51	   1.  Introduction and problem description . . . . . . . . . . . . .  3
52	   2.  Detailed examples  . . . . . . . . . . . . . . . . . . . . . .  4
53	     2.1.  Dhivehi  . . . . . . . . . . . . . . . . . . . . . . . . .  4
54	     2.2.  Yiddish  . . . . . . . . . . . . . . . . . . . . . . . . .  4
55	     2.3.  Strings with numbers . . . . . . . . . . . . . . . . . . .  5
56	   3.  An expanded justification for the bidi rule  . . . . . . . . .  6
57	   4.  Modification to RFC 3454 . . . . . . . . . . . . . . . . . . .  6
58	     4.1.  Alternative approach . . . . . . . . . . . . . . . . . . .  7
59	   5.  Other issues in need of resolution . . . . . . . . . . . . . .  8
60	   6.  Backwards compatibility considerations . . . . . . . . . . . .  9
61	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 10
62	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 10
63	   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10
64	   Appendix A.  Change log  . . . . . . . . . . . . . . . . . . . . . 11
65	     A.1.  Changes from -00 to -01  . . . . . . . . . . . . . . . . . 11
66	     A.2.  Changes from -01 to -02  . . . . . . . . . . . . . . . . . 11
67	   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 11
68	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11
69	   Intellectual Property and Copyright Statements . . . . . . . . . . 13

71	1.  Introduction and problem description

73	   The IDNA specification "Stringprep", [RFC3454] makes the following
74	   statement in its section 6 on the bidi algorithm, :

76	      3) If a string contains any RandALCat character, a RandALCat
77	      character MUST be the first character of the string, and a
78	      RandALCat character MUST be the last character of the string.

80	   (A RandAlCat character is a character with unambiguously right-to-
81	   left directionality.)

83	   The reasoning behind this prohibition was to ensure that every
84	   component of a visually presented domain name has an unambiguously
85	   preferred direction.  However, this makes certain words in languages
86	   written with right-to-left scripts invalid as IDN labels, and in at
87	   least one case means that all the words of an entire language are
88	   forbidden as IDN labels.

90	   This will be illustrated below with examples taken from the Dhivehi
91	   and Yiddish languages, as written with the Thaana and Hebrew scripts,
92	   respectively.

94	   The problem may be addressed by more carefully considering the bidi
95	   algorithm in Unicode Standard Annex #9 [UAX9] which states in section
96	   3.3.3 W1: "Examine each non-spacing mark (NSM) in the level run, and
97	   change the type of the NSM to the type of the previous character."
98	   (See below for some terminology).

100	   Section 3 of UAX9 contains several instructions for determining the
101	   directionality of the characters in a string.  Some of them (for
102	   instance those using explicit embedding) are irrelevant to IDNA
103	   because the corresponding codes are not permitted as IDNA input, so a
104	   slightly simplified version should be enough for IDNA purposes.

106	   A note on terminology:

108	   In this memo, we use "network order" to describe the sequence of
109	   characters as transmitted on the wire or stored in a file; the terms
110	   "first", "next" and "previous" are used to refer to the relationship
111	   of characters in network order.

113	   We use "display order" to talk about the sequence of characters as
114	   imaged on a display medium; the terms "left" and "right" are used to
115	   refer to the relationship of characters in display order.

117	2.  Detailed examples

119	2.1.  Dhivehi

121	   Dhivehi, the official language of the Maldives, is written with the
122	   Thaana script.  This displays some of the characteristics of Arabic
123	   script, including its directional properties, and the indication of
124	   vowels by the diacritical marking of consonantal base characters.
125	   This marking is obligatory, and both double vowels and syllable-final
126	   consonants are indicated by the marking of special unvoiced
127	   characters.  Every Dhivehi word therefore ends with a combining mark.

129	   The word for "computer", which is romanized as "konpeetaru", is
130	   written with the following sequence of Unicode code points:

132	      U+0786 THAANA LETTER KAAFU (AL)

134	      U+07AE THAANA OBOFILI (NSM)

136	      U+0782 THAANA LETTER NOONU (AL)

138	      U+07B0 THAANA SUKUN (NSM)

140	      U+0795 THAANA LETTER PAVIYANI (AL)

142	      U+07A9 THAANA LETTER EEBEEFILI (AL)

144	      U+0793 THAANA LETTER TAVIYANI (AL)

146	      U+07A6 THAANA ABAFILI (NSM)

148	      U+0783 THAANA LETTER RAA (AL)

150	      U+07AA THANAA UBIUFILI (NSM)

152	   The directionality class of U+07AA in the Unicode database is NSM
153	   (non-spacing mark), which is not R or AL; a conformant implementation
154	   of the IDNA algorithm will say that "this is not in RandALCat", and
155	   refuse to encode the string.

157	2.2.  Yiddish

159	   Yiddish is one of several languages written with the Hebrew script
160	   (others include Hebrew and Ladino).  This is basically a consonantal
161	   alphabet but Yiddish is written using an extended form that is fully
162	   vocalic.  The vowels are indicated in several ways, of which one is
163	   by repurposing letters that are consonants in Hebrew.  Other letters
164	   are used both as vowels and consonants, with combining marks used to
165	   differentiate between them.  Finally, some base characters can
166	   indicate several different vowels, which are also disambiguated by
167	   combining marks.  Marked characters can appear in word-final position
168	   and may therefore also be needed at the end of labels.  This is not
169	   an invariable attribute of all Yiddish strings and there is thus
170	   greater latitude here than there is with Dhivehi.

172	   The "YIVO Institute for Jewish Research" is widely known by the
173	   acronym of its Yiddish name.  This organization maintains a primary
174	   reference standard for modern Standard Yiddish orthography, that is
175	   also commonly referred to by the same acronym (as the "YIVO Rules").
176	   YIVO is written with the Hebrew letters YOD YOD HIRIQ VAV VAV ALEF
177	   QAMATS, where HIRIQ and QAMATS are combining "points":

179	      U+05D9 HEBREW LETTER YOD (R)

181	      U+05B4 HEBREW POINT HIRIQ (NSM)

183	      U+05D5 HEBREW LETTER VAV (R)

185	      U+05D0 HEBREW LETTER ALEF (R)

187	      U+05B8 HEBREW POINT QAMATS (NSM)

189	   The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode
190	   database is NSM, which again causes the IDNA algorithm to reject the
191	   string.  (It may also be noted that the requisite combined characters
192	   also exist in precomposed form at separate positions in the Unicode
193	   chart.  However, Stringprep also rejects those codepoints, for
194	   reasons not discussed here.)

196	2.3.  Strings with numbers

198	   RFC 3454, in its insistence that the first or last character of a
199	   string be category R or AL, prohibited strings that contained right-
200	   to-left characters and numbers.

202	   Considering the string ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE), if
203	   we specify that UAX#9 is used to find the directionality of
204	   characters, this string will have a consistent direction (R).
205	   However, the string 5 ALEF, when embedded in an LTR context, will
206	   have the same display order, with a different direction assigned to
207	   the number 5.  These two display strings are confusable, so we need a
208	   rule that permits only one of these in a domain name label.

210	3.  An expanded justification for the bidi rule

212	   One issue with RFC 3454 was that it did not give an explicit
213	   justification for the bidi rule, thus it was hard to tell if a
214	   modified rule would continue to fulfil the purpose for which the RFC
215	   3454 rule was written.

217	   This document proposes an explicit justification, for which we think
218	   it is possible to test whether or not the modified rule fulfils the
219	   justification.

221	   The justification proposed is this:

223	   o  No two labels, when presented in visual order, should have the
224	      same sequence of characters without also having the same sequence
225	      of characters in network order.  (This is the criterion that is
226	      explicit in RFC 3454).

228	   o  In a visual presentation of a string of labels, the characters of
229	      each label should remain grouped between the dots delimiting the
230	      label components.

232	   o  This property should hold true both when the string is embedded in
233	      a RTL context and when it's embedded in a LTR context.

235	   o  This property should hold true without adding extra formatting,
236	      for example bidi control characters, to the string.

238	   Several stronger statements were considered and rejected, because
239	   they seem to be impossible to fulfil within the constraints of the
240	   Unicode bidirectional algorithm.  These include:

242	   o  The appearance of a label should be unaffected by its embedding
243	      context.  This proved impossible even for ASCII labels; the label
244	      "123-456" will have a different display order in a RTL context
245	      than in a LTR context.

247	   o  The sequence of labels should be consistent with network order.
248	      This proved impossible - a domain name consisting of the labels
249	      (in network order) L1.R1.R2.L2 will be displayed as L1.R2.R1.L2 in
250	      an LTR context.

252	4.  Modification to RFC 3454

254	   If the following modification is made to RFC 3454 section 6,
255	   paragraph 4, we believe that the usefulness of the specification for
256	   languages written with right-to-left scripts will be significantly
257	   improved:

259	   Old text:

261	      [Unicode3.2] defines several bidirectional categories; each
262	      character has one bidirectional category assigned to it.  For the
263	      purposes of the requirements below, an "RandALCat character" is a
264	      character that has Unicode bidirectional categories "R" or "AL";
265	      an "LCat character" is a character that has Unicode bidirectional
266	      category "L".

268	   New text:

270	      [Unicode3.2] defines several bidirectional categories; each
271	      character has one bidirectional category assigned to it.

273	      For characters that have category "R", "AL" or "L", the category
274	      is fixed (UAX#9 defines them as having "strong" category); for
275	      characters in category EN, ES, ET, AN, CS, NSM, BN, B, S, WS and
276	      ON, the category is determined by applying the algorithm described
277	      in UAX#9 section 3.3 to the string.

279	      For the purposes of the requirements below, an "RandALCat
280	      character" is a character that, after this determination, has
281	      Unicode bidirectional categories "R" or "AL"; an "LCat character"
282	      is a character that has Unicode bidirectional category "L".

284	   Note that Unicode 5.0 is the current version of Unicode.  This fix
285	   refers to Unicode 3.2 only, to maintain consistency with the rest of
286	   RFC 3454.  Nothing here should affect the relationship between
287	   Unicode versions and IDNA.

289	   Also, as noted in the introduction, the Unicode UAX#9 algorithm is
290	   quite complex.  For the purposes of IDNA, a simpler algorithm may be
291	   defind that yields the same result within the constraints of this
292	   context, but may be easier for people to implement consistently.
293	   Such an algorithm may be included in later versions of this memo.

295	4.1.  Alternative approach

297	   The editors are not entirely happy with the text above.  We are
298	   considering, instead, a complete replacement for section 6 of RFC
299	   3454.

301	   A first draft of such a section is below.

303	   Conceptually, to verify suitability as a domain name label, one
304	   constructs the string consisting of the label preceded and followed
305	   by a full stop (U+002E), and executes the Unicode bidirectional
306	   algorithm twice, once with <sor> (start of run) and <eor> (end of
307	   run) having direction L, and once with them having direction R. (The
308	   full stop, being of bidi class CS, is used because it seems likely to
309	   show up any problems, and occurs next to labels a lot of the time.
310	   Other times, a label is adjacent to an @ sign, a space or another
311	   character.)

313	   The following conditions MUST be true in both resulting strings for
314	   the string to be acceptable:

316	   o  The leftmost and rightmost character of the resulting string in
317	      display order must be a full stop (U+002E)

319	   o  No non-spacing mark (NSM) can occur in the second position of the
320	      string (leftmost in L order, rightmost in R order); that is, no
321	      mark can be allowed to attach to the delimiting characters.

323	   o  The direction of the leftmost and rightmost characters in the
324	      string (the periods) must be either L or R

326	   Note that there is no requirement that the character sequence be the
327	   same in the two cases.

329	   All RTL strings permitted by RFC 3454 section 6 will pass this test.
330	   Strings that consist of such a string with NSM characters appended to
331	   it will also pass this test.

333	   [[NOTE: Not sure if the ALEF 5 vs 5 ALEF issue will be solved by this
334	   rule.  Test needed.]]

336	   [[NOTE: do we need to require something for the sor=L, eor=R and
337	   sor=R, eor=L cases?]]

339	5.  Other issues in need of resolution

341	   This is not the only issue with right-to-left scripts.  Retaining
342	   Yiddish for the purposes of further exemplification, its alphabet
343	   includes three digraphs that can be encoded both as consecutive
344	   instances of the two component characters, and as precomposed
345	   ligatures.  One of these digraphs also requires additional combined
346	   marking.  For example, the HEBREW LIGATURE YIDDISH DOUBLE VAV
347	   (U+05F0) is orthographically equivalent to, and typographically
348	   utterly confusable with, a sequence of two HEBREW LETTER VAV
349	   (U+05D5).  However, the ligature has no canonical decomposition and
350	   is therefore preserved by the IDNA algorithm.  These digraphs need to
351	   be enumerated and the one form either made invalid for input in the
352	   IDNA context, or normalized to the other.

354	   We believe that there is a clear likelihood of similar issues
355	   existing with other scripts and languages that are not currently used
356	   extensively with IDNs.  Careful consideration of all the languages
357	   written in a given script, in consultation with all of the
358	   corresponding speech communities, is therefore needed before we can
359	   say with any degree of certainty that using that script for IDNs is
360	   unproblematic.

362	   Another set of issues concerns the proper display of IDNs with a
363	   mixture of LTR and RTL labels, or only RTL labels; it is not clear to
364	   these authors what the proper display order of the components of a
365	   domain name are if the directiion of the components (in network
366	   order) is, for instance, FirstRTL.SecondRTL.LTR - is it
367	   LTRtsriF.LTRdnoceS.LTR or LTRdnoceS.LTRtsrif.LTR?  Again, this memo
368	   does not attempt to suggest a solution to this problem.

370	6.  Backwards compatibility considerations

372	   As with any change to an existing standard, it is important to
373	   consider what happens with existing implementations when the change
374	   is introduced.  The following troublesome cases have been noted:

376	   o  Old program used to input the newly allowed string.  If the old
377	      program checks the input against RFC 3454, the string will not be
378	      allowed, and that domain name will remain inaccessible.

380	   o  Old program is asked to display the newly allowed string, and
381	      checks it against RFC 3454 before displaying.  The program will
382	      perform some kind of fallback, most likely displaying the Punycode
383	      form of the string.

385	   o  Old program tries to display the newly allowed string.  If the old
386	      program has code for displaying the last character of a string
387	      that is different from the code used to display the characters in
388	      the middle of the string, display may be inconsistent and cause
389	      confusion.

391	   One particular example of the last case is if a program chooses to
392	   examine the last character (in network order) of a string in order to
393	   determine its directionality, rather than its first; if it finds an
394	   NSM character and tries to display the string as if it was a left-to-
395	   right string, the resulting display may be interesting, but not
396	   useful.

398	   The editors believe that these cases will have less harmful impact in
399	   practice than continuing to deny the use of words from the languages
400	   for which these strings are necessary as IDN labels.

402	7.  IANA Considerations

404	   This document makes no request of IANA.

406	   Note to RFC Editor: this section may be removed on publication as an
407	   RFC.

409	8.  Security Considerations

411	   This modification will allow some strings to be used in Stringprep
412	   contexts that are not allowed today.  It is possible that differences
413	   in the interpretation of the specification between old and new
414	   implementations could pose a security risk, but it is difficult to
415	   envision any specific instantiation of this.

417	   Any rational attempt to compute, for instance, a hash over an
418	   identifier processed by stringprep would use network order for its
419	   computation, and thus be unaffected by the changes proposed here.

421	   While it is not believed to pose a problem, if display routines had
422	   been written with specific knowledge of the current Stringprep
423	   prohibitions, it is possible that the possible problems noted under
424	   "backwards compatibility" could cause new kinds of confusion.

426	9.  Acknowledgements

428	   While the listed editors held the pen, this document represents the
429	   joint work and conclusions of an ad hoc design team.  In addition to
430	   the editors this consisted of, in alphabetic order, Tina Dam, Patrik
431	   Faltstrom, and John Klensin.  Many further specific contributions and
432	   helpful comments were received from the people listed below, and
433	   others who have contributed to the development and use of the IDNA
434	   protocols.

436	   The team wishes in particular to thank Roozbeh Pournader for calling
437	   its attention to the issue with the Thaana script, and Paul Hoffmann
438	   for pointing out the need to be explicit about backwards
439	   compatibility considerations.

441	Appendix A.  Change log

443	   This appendix is intended to be removed when this document is
444	   published as an RFC.

446	A.1.  Changes from -00 to -01

448	   Suggested a possible new algorithm.

450	   Multiple smaller changes.

452	A.2.  Changes from -01 to -02

454	   Date of publication updated.

456	   Change log added.

458	10.  References

460	   [RFC3454]  Hoffman, P. and M. Blanchet, "Preparation of
461	              Internationalized Strings ("stringprep")", RFC 3454,
462	              December 2002.

464	   [UAX9]     0, "Unicode Standard Annex #9: The Bidirectional
465	              Algorithm, revision 15", 03 2005.

467	Authors' Addresses

469	   Harald Tveit Alvestrand (editor)
470	   Google
471	   Beddingen 10
472	   Trondheim,   7014
473	   Norway

475	   Email: harald@alvestrand.no
476	   Cary Karp (editor)
477	   Swedish Museum of Natural History
478	   Frescativ. 40
479	   Stockholm,   10405
480	   Sweden

482	   Phone: +46 8 5195 4055
483	   Fax:
484	   Email: ck@nrm.museum
485	   URI:

487	Full Copyright Statement

489	   Copyright (C) The IETF Trust (2008).

491	   This document is subject to the rights, licenses and restrictions
492	   contained in BCP 78, and except as set forth therein, the authors
493	   retain all their rights.

495	   This document and the information contained herein are provided on an
496	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
497	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
498	   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
499	   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
500	   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
501	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

503	Intellectual Property

505	   The IETF takes no position regarding the validity or scope of any
506	   Intellectual Property Rights or other rights that might be claimed to
507	   pertain to the implementation or use of the technology described in
508	   this document or the extent to which any license under such rights
509	   might or might not be available; nor does it represent that it has
510	   made any independent effort to identify any such rights.  Information
511	   on the procedures with respect to rights in RFC documents can be
512	   found in BCP 78 and BCP 79.

514	   Copies of IPR disclosures made to the IETF Secretariat and any
515	   assurances of licenses to be made available, or the result of an
516	   attempt made to obtain a general license or permission for the use of
517	   such proprietary rights by implementers or users of this
518	   specification can be obtained from the IETF on-line IPR repository at
519	   http://www.ietf.org/ipr.

521	   The IETF invites any interested party to bring to its attention any
522	   copyrights, patents or patent applications, or other proprietary
523	   rights that may cover technology that may be required to implement
524	   this standard.  Please address the information to the IETF at
525	   ietf-ipr@ietf.org.

527	Acknowledgment

529	   Funding for the RFC Editor function is provided by the IETF
530	   Administrative Support Activity (IASA).