idnits 2.17.1 

draft-duerst-iri-bidi-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 382 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 6 instances of too long lines in the document, the longest one
     being 2 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (July 13, 2001) is 8322 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'Nameprep' is defined on line 342, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'HTML4'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IDN'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'IRI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Nameprep'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UnicodeBidi'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'W3C IRI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'XML'


     Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                              Martin Duerst
2	                                                       W3C/Keio University
3	draft-duerst-iri-bidi-00.txt
4	Expires January 2002                                        July 13, 2001

6	            Internet Identifiers and Bidirectionality

8	Status of this Memo

10	This document is an Internet-Draft and is in full conformance with all
11	provisions of Section 10 of RFC2026.

13	Internet-Drafts are working documents of the Internet Engineering Task
14	Force (IETF), its areas, and its working groups.  Note that other
15	groups may also distribute working documents as Internet-Drafts.

17	Internet-Drafts are draft documents valid for a maximum of six months
18	and may be updated, replaced, or obsoleted by other documents at any
19	time.  It is inappropriate to use Internet- Drafts as reference
20	material or to cite them other than as "work in progress."

22	The list of current Internet-Drafts can be accessed at
23	http://www.ietf.org/ietf/1id-abstracts.txt.

25	The list of Internet-Draft Shadow Directories can be accessed at
26	http://www.ietf.org/shadow.html.

28	This document is not a product of any working group, but should be
29	discussed on the mailing list <uri@w3.org>. Comments of editorial
30	nature should be sent directly to the author. For more information
31	on the topic of this Internet-Draft, please also see [W3C IRI].

33	Abstract

35	This memo describes how to deal with Internet identifiers containing
36	characters from scripts such as Arabic and Hebrew, which use right-to-
37	left or bidirectional writing. The solution proposed addresses three
38	different contexts: The purely graphical representation of such
39	identifiers, e.g. on paper, the embedding of such identifiers into
40	running text with established rules for bidirectionality, and the
41	processing and resolution of such identifiers.

43	0. Change history

45	Version 00:

47	This memo has been separated out from [IRI], Section 3.2 to allow
48	more in-depth and focused discussion of the specific problems of
49	bidirectionality.

51	1. Introduction

53	There is an increased tendency to allow identifiers to use a wide
54	range of characters from the scripts of the world. The Universal
55	Character Set (UCS, see [Unicode] and [ISO10646]) makes it easy to
56	use and exchange such identifiers digitally. With the appropriate
57	care (similar to the care needed to avoid confusion between '1', 'l',
58	and 'I' in US-ASCII-based identifiers), such identifiers can also
59	be exchanged non-digitally, e.g. written down visually on a medium
60	such as paper. Potential examples of such idenitifiers include
61	Internationalized Resource Identifiers [IRI], Internationalized
62	Domain Names [IDN], and internationalized email addresses.

64	Some characters, in particular those of the Arabic and the Hebrew
65	script, are written from right to left. Together with characters
66	written from left to right, or with digits that are written from
67	left to right even in these scripts, this gives raise to the
68	mixture of different writing directions, a phenomenon called
69	bidirectionality. Dealing with bidirectionality is indispesable
70	for the proper treatment of text written with the Arabic or Hebrew
71	script. But it is highly complex because user expectations may
72	depend on context and are often difficult to identify and express.

74	This memo deals with the specific problems of Internet identifiers
75	containing rigth-to-left characters, hereafter called bidirectional
76	identifiers.

78	The basic paradigm of all modern bidirectional text handling solutions
79	is the distinction between digital backing store, where text is stored
80	in logical order, and rendering (display or printing), for which the
81	necessary reordering is applied according to well-defined rules.
82	'Logical order' in this context is the order in which the characters
83	in the text are pronounced or spelled out.

85	Using logical order in the digital backing store simplifies a large
86	number of operations, including sorting, searching, text-to-speech
87	conversion, various other kinds of linguistic processing, input from
88	keyboards and other devices, and rendering-related operations such as
89	line breaking and text reflow. The alternative is to use display order
90	even in the backing store, but this makes some of the operations above
91	much more complex and others impossible.

93	For general text (e.g. average prose,...), the Unicode bidirectional
94	algorithm [UnicodeBidi] is the single widely accepted and used reference
95	for providing this reordering from logical order to rendering. The
96	Unicode bidirectional algorithm consists of an implicit part (producing
97	adequate results in most cases) and explicit formatting characters for
98	advanced cases.

100	The Unicode bidirectional algorithm also allows higher-order protocols
101	o overwrite certain aspects of the algorithm. A case where this has
102	been done is the 'dir' attribute in [HTML4].

104	Bidirectional Internet identifiers primarily are used in three different
105	contexts:

107	1) In visual form: This includes display on CTR and LCDs as well as
108	    more permanent visual forms such as printing. At least as far
109	    as the reading of individual components is concerned, the visual
110	    form has to use the inherent directionality of the characters used.
111	    Otherwise, identification, reading, transcription, and so on are
112	    severely affected.

114	2) In digital form inside running text (e.g. an IRI or an email
115	    address in an email or on a web page). It is not always easy
116	    or possible to distinguish identifiers from other text.

118	3) In digital form on its own (e.g. in a structured format or
119	    database of identifiers, or when transmitted for resolution).
120	    It should be possible to process bidirectional identifiers
121	    in the same way as other Internet identifiers.

123	This memo addresses all these three cases as well as the conversion
124	between them. The specifics of bidirectional text and of identifier
125	structure make it impossible to design a solution that works without
126	additional effort (when compared to non-bidirectional identifiers).
127	However, the solution proposed in this memo is designed to make the
128	best out of the severe constraints.

130	2. Notational Conventions

132	Keywords in all upper-case such as MUST and SHOULD are defined
133	in [RFC 2119]. For examples, lower-case letters are used for
134	letters that flow left to right. Upper-case letters stand for
135	letters that flow from right to left. A left-to-right example
136	would be 'hello', whereas a right-to-left example would be
137	'OLLEH'.

139	For bidirectional formatting characters from [Unicode], the [XML]-style
140	entitiy notation is used, as follows:

142	&lrm;    U+200E     LEFT-TO-RIGHT MARK
143	&rlm;    U+200F     RIGHT-TO-LEFT MARK
144	&lre;    U+202A     LEFT-TO-RIGHT EMBEDDING
145	&rle;    U+202B     RIGHT-TO-LEFT EMBEDDING
146	&pdf;    U+202C     POP DIRECTIONAL FORMATTING
147	&lro;    U+202D     LEFT-TO-RIGHT OVERRIDE
148	&rlo;    U+202E     RIGHT-TO-LEFT OVERRIDE

150	Only the first two are defined in [HTML4]; the others are
151	replaced by the 'dir' attribute and the <bdo> element.

153	3. Identifier Structure

155	Most Internet identifiers have an inherent structure that distinguishes
156	structural characters (usually punctuation such as '@', '.', ':', '/',
157	and so on) and payload components (usually formed with plain alphabetic
158	or alphanumeric characters).

160	In order to be able to process bidirectional identifiers in the same
161	way as other identifiers, it is crucial that in the digital
162	representations, the individual structural characters and identifier
163	components are stored in the same sequence as for other identifiers.

165	The main problem to solve for the visual representation of bidirectional
166	identifiers is whether the general sequence of components and syntax
167	characters should be from left to right or from right to left, i.e.
168	whether the right-to-left equivalent of "ftp.example.com" should be
169	"MOC.ELPMAXE.PTF" or "PTF.ELPMAXE.MOC". The former one may be
170	seen as more natural in a purely right-to-left context. But there is also
171	the possibility of mixed identifiers such as "PTF.ELPMAXE.com".
172	These provide a very strong motivation for maintaining the same
173	left-to-right overall component sequence for all Internet identifiers.

175	The Unicode bidirectional algorithm, extremely simplified, tries to
176	reorder continuous sequences of right-to-left characters between
177	continuous sequences of left-to-right characters. A third category,
178	called neutrals, is processed in the same way as surrounding characters.
179	The main problem for identifiers is that all the structural characters
180	are treated as neutrals by the Unicode algorithm, which means that they
181	are moved around together with their context. As an example, the
182	logical sequence FTP.EXAMPLE.com (corresponding to the example above),
183	without additional care is displayed as ELPMAXE.PTF.com, which is
184	obviously highly confusing.

186	4. Bidirectional Identifiers in Context

188	4.1 Independently Processed Bidirectional Identifiers

190	Bidirectional identifiers processed independently, i.e. stored or
191	transmitted for resolution, MUST be in full logical order both
192	for the overall structure as well as for the individual components.
193	They MUST conform directly to the relevant syntax rules.

195	4.2 Visual Rendering of Bidirectional Identifiers

197	Bidirectional Identifiers MUST be rendered visually by rendering
198	each component and each structural character from left to right.
199	They MUST render each component according to its natural direction
200	(i.e. left-to-right for components with left-to-right characters,
201	right-to-left for components with right-to-left characters).

203	4.3 Bidirectional Identifiers in Textual Context

205	In textual context, i.e. assuming rendering by the Unicode bidirectional
206	algorithm, the backing store representation prescribed in Section 4.1
207	and the visual rendering prescribed in section 4.2 have to be
208	combined. This is done as follows:

210	- Each component with right-to-left characters is preceded and
211	   followed by an &lrm;. This left-to-right mark provides a
212	   left-to-right context to intervening syntactic characters.

214	- If the overall context (base directionality) is right-to-left,
215	   the identifier is preceded by an &lre; and followed by a &pdf;.
216	   This makes sure that the components of the identifier are
217	   rendered in left-to-right order. This may also be done by
218	   using the equivalent features of a higher-order protocol
219	   (e.g. by using the dir='ltr' attribute in HTML).

221	4.4 Conversions

223	Conversion from textual context to visual representation is done
224	simply by applying the Unicode bidirectional algorithm, i.e. by
225	passing the whole text to an appropriate rendering engine.

227	Conversion from processing representation to textual context is
228	done by adding the necessary formatting characters as described
229	in Section 4.3.

231	Conversion from textual context to processing representation is
232	done by removing the formating characters at the positions
233	described in Section 4.3. For international domain names, this
234	can e.g. be integrated in [nameprep].

236	Conversion from visual representation to processing representation
237	is done by inputting the identifier, component-by-component from
238	left to right, using the natural reading order for each component.

240	 From these three conversions, the remaining conversions can be
241	easily constructed. Any other procedure that leads to the same results
242	is also allowed.

244	5. Restrictions

246	The definitions and conversions in Section 4 only work under the
247	following restrictions.

249	1) A component MUST NOT not use both right-to-left and left-to-right
250	    characters.

252	2) A component MUST NOT contain bidirectional formatting characters
253	    except for those and in those positions as defined in Section 4.3.

255	3) A component using right-to-left characters MUST NOT use any other
256	    class of characters (e.g. neutrals or numbers).

258	Restrictions 1) and 2) are not very severe, in that they do not overly
259	restrict useful identifiers. Also, trying to remove it would make it
260	impossible for humans to predict the logical sequence of characters
261	inside a single component. On the other hand, it would be very desirable
262	to remove or at least soften restriction 3). Otherwise, it is impossible
263	to combine Arabic or Hebrew letters with numbers, or to use a hyphen
264	between two subcomponents of an Arabic component to avoid the cursive
265	connection of the two subcomponents. To a certain extent, softening this
266	restriction should be easily possible by adding additional formating
267	characters in well defined ways similar to the provisions in Section 4.3.
268	Feedback on this issue is particularly welcome.

270	6. Security Considerations

272	Knowledge of deficiencies of a particular implementation of the above
273	specification can allow somebody to pretend to resolve a particular
274	identifier when indeed another identifier is being resolved.

276	Acknowledgements

278	The basic idea for the approach proposed in this memo are due to
279	Francois Yergeau, and go back to around 1995. Discussions with
280	Stephen Atkin, Paul Hoffman, and many others provided additional
281	motivation and insight.

283	Copyright

285	Copyright (C) The Internet Society, 1997. All Rights Reserved.

287	This document and translations of it may be copied and furnished to
288	others, and derivative works that comment on or otherwise explain it
289	or assist in its implementation may be prepared, copied, published
290	and distributed, in whole or in part, without restriction of any
291	kind, provided that the above copyright notice and this paragraph
292	are included on all such copies and derivative works.  However, this
293	document itself may not be modified in any way, such as by removing
294	the copyright notice or references to the Internet Society or other
295	Internet organizations, except as needed for the purpose of
296	developing Internet standards in which case the procedures for
297	copyrights defined in the Internet Standards process must be
298	followed, or as required to translate it into languages other
299	than English.

301	The limited permissions granted above are perpetual and will not be
302	revoked by the Internet Society or its successors or assigns.

304	This document and the information contained herein is provided on an
305	"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
306	TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
307	BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
308	HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
309	MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."

311	Author's address

313	           Martin J. Duerst
314	           W3C/Keio University
315	           5322 Endo, Fujisawa
316	           252-8520 Japan
317	           duerst@w3.org
318	           http://www.w3.org/People/D%C3%BCrst/
319	           Tel/Fax: +81 466 49 1170

321	           Note: Please write "Duerst" with u-umlaut wherever
322	                 possible, e.g. as "D&#252;rst" in XML and HTML.

324	References

326	[HTML4] "HTML 4.01", World Wide Web Consortium,
327	   <http://www.w3.org/TR/REC-html40>.

329	[IDN] Internationalized Domain Name (idn) IETF Working Group. For
330	   furter information, please see
331	   <http://www.ietf.org/html.charters/idn-charter.html>.

333	[IRI] L. Masinter, M. Duerst, "Internationalized Resource Identifiers
334	   (IRI)", Internet Draft, Jan. 2001,
335	   <http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-07.txt>,
336	   work in progress.

338	[ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet
339	   Coded Character Set (UCS) - Part 1: Architecture and Basic
340	   Multilingual Plane, Oct. 2000, with amendments.

342	[Nameprep] P. Hoffman, M. Blanchet, "Preparation of Internationalized
343	   Host Names", Internet Draft, Feb. 2001,
344	   <http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-03.txt>,
345	   work in progress.

347	[RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate
348	   Requirement Levels", March 1997.

350	[Unicode] The Unicode Consortium, "The Unicode Standard, Version 3.1",
351	   consisting of: "The Unicode Standard, Version 3.0", Addison-Wesley,
352	   Reading, MA, 2000, and "Unicode Standard Annex #27: Unicode 3.1",
353	   <http://www.unicode.org/unicode/reports/tr27/>, May 2001.

355	[UnicodeBidi] The Unicode Consortium, "The Unicode Standard, Version
356	   3.0", Addison-Wesley, Reading, MA, 2000, Section 3.12, pp. 55-69, also
357	   available at <http://www.unicode.org/unicode/uni2book/ch03.pdf>
358	   and "Unicode Standard Annex #9: The Bidirectional Algorithm",
359	   <http://www.unicode.org/unicode/reports/tr9/tr9-9.html">, March 2001.

361	[W3C IRI] Internationalization - URIs and other identifiers
362	   <http://www.w3.org/International/O-URL-and-ident.html>.

364	[XML] "XML 1.0", World Wide Web Consortium Recommendation,
365	   <http://www.w3.org/TR/REC-xml#sec-external-ent>.