idnits 2.17.1 

draft-klensin-unicode-escapes-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 613.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 624.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 631.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 637.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (November 17, 2007) is 5997 days in the past.  Is this
     intentional?


  Checking references for intended status: Best Current Practice
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  ** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode'


     Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         J. Klensin
3	Internet-Draft                                         November 17, 2007
4	Intended status: Best Current
5	Practice
6	Expires: May 20, 2008

8	                  ASCII Escaping of Unicode Characters
9	                 draft-klensin-unicode-escapes-07.txt

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on May 20, 2008.

36	Copyright Notice

38	   Copyright (C) The IETF Trust (2007).

40	Abstract

42	   There are a number of circumstances in which an escape mechanism is
43	   needed in conjunction with a protocol to encode characters that
44	   cannot be represented or transmitted directly.  With ASCII coding the
45	   traditional escape has been either the decimal or hexadecimal numeric
46	   value of the character, written in a variety of different ways.  The
47	   move to Unicode, where characters occupy two or more octets and may
48	   be coded in several different forms, has further complicated the
49	   question of escapes.  This document discusses some options now in use
50	   and discusses considerations for selecting one for use in new IETF
51	   protocols and protocols that are now being internationalized.

53	Table of Contents

55	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
56	     1.1.  Context and Background . . . . . . . . . . . . . . . . . .  3
57	     1.2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  4
58	     1.3.  Discussion List  . . . . . . . . . . . . . . . . . . . . .  4
59	   2.  Encodings that Represent Unicode Code Points: Code
60	       Position versus UTF-8 or UTF-16 Octets . . . . . . . . . . . .  4
61	   3.  Referring to Unicode Characters  . . . . . . . . . . . . . . .  5
62	   4.  Syntax for Code Point Escapes  . . . . . . . . . . . . . . . .  6
63	   5.  Recommended Presentation Variants for Unicode Code Point
64	       Excapes  . . . . . . . . . . . . . . . . . . . . . . . . . . .  6
65	     5.1.  Backslash-U with Delimiters  . . . . . . . . . . . . . . .  7
66	     5.2.  XML and HTML . . . . . . . . . . . . . . . . . . . . . . .  7
67	   6.  Forms that are Normally Not Recommended  . . . . . . . . . . .  8
68	     6.1.  The C Programming Language: Backslash-U  . . . . . . . . .  8
69	     6.2.  Perl: A Hexadecimal String . . . . . . . . . . . . . . . .  8
70	     6.3.  Java: Escaped UTF-16 . . . . . . . . . . . . . . . . . . .  9
71	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . .  9
72	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . .  9
73	   9.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . .  9
74	   10. Change log . . . . . . . . . . . . . . . . . . . . . . . . . . 10
75	     10.1. Changes in -01 . . . . . . . . . . . . . . . . . . . . . . 10
76	     10.2. Major Changes in -02 . . . . . . . . . . . . . . . . . . . 10
77	     10.3. Major Changes in -03 . . . . . . . . . . . . . . . . . . . 10
78	     10.4. Major Changes in -04 . . . . . . . . . . . . . . . . . . . 10
79	     10.5. Changes in -05 . . . . . . . . . . . . . . . . . . . . . . 11
80	     10.6. Changes in -06 . . . . . . . . . . . . . . . . . . . . . . 11
81	     10.7. Changes in -07 . . . . . . . . . . . . . . . . . . . . . . 11
82	   Appendix A.   Formal Syntax for Forms Not Recommended  . . . . . . 11
83	   Appendix A.1. The C Programming Language Form  . . . . . . . . . . 11
84	   Appendix A.2. Perl Form  . . . . . . . . . . . . . . . . . . . . . 12
85	   Appendix A.3. Java Form  . . . . . . . . . . . . . . . . . . . . . 12
86	   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12
87	     11.1. Normative References . . . . . . . . . . . . . . . . . . . 12
88	     11.2. Informative References . . . . . . . . . . . . . . . . . . 12
89	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 13
90	   Intellectual Property and Copyright Statements . . . . . . . . . . 14

92	1.  Introduction

94	1.1.  Context and Background

96	   There are a number of circumstances in which an escape mechanism is
97	   needed in conjunction with a protocol to encode characters that
98	   cannot be represented or transmitted directly.  With ASCII [ASCII]
99	   coding the traditional escape has been either the decimal or
100	   hexadecimal numeric value of the character, written in a variety of
101	   different ways.  For example, in different contexts, we have seen
102	   %dNN or %NN for the decimal form, %NN, %xNN, X'nn', and %X'NN' for
103	   the hexadecimal form. "%NN" has become popular in recent years to
104	   represent a hexadecimal value without further qualification, perhaps
105	   as a consequence of its use in URLs and their prevalence.  There are
106	   even some applications around in which octal forms are used and,
107	   while they do not generalize well, the MIME Quoted-Printable and
108	   Encoded-word forms can be thought of as yet another set of escapes.
109	   So, even for the fairly simple cases of ASCII and standard built by
110	   extending ASCII, such as the ISO 8859 family, we have been living
111	   with several different escaping forms, each the result of some
112	   history.

114	   When one moves to Unicode [Unicode] [ISO10646], where characters
115	   occupy two or more octets and may be coded in several different
116	   forms, the question of escapes becomes even more complicated.
117	   Unicode represents characters as code points: numeric values from 0
118	   to hex 10FFFF.  When referencing code points in flowing text, they
119	   are represented using the so-called "U+" notation, as values from
120	   U+0000 to U+10FFFF.  When serialized into octets, these code points
121	   can be represented in different forms:

123	   o  in UTF-8 with one to four octets [RFC3629]

125	   o  in UTF-16 with two or four octets (or one or two seizets - 16-bit
126	      units)

128	   o  in UTF-32 with exactly four octets (or one 32-bit unit)

130	   When escaping characters, we have seen fairly extensive use of
131	   hexadecimal representations of both the serialized forms and
132	   variations on the U+ notation, known as code point escapes.

134	   In accordance with existing best-practices recommendations [RFC2277],
135	   new protocols that are required to carry textual content for human
136	   use SHOULD be designed in such a way that the full repertoire of
137	   Unicode characters may be represented in that text.

139	   This document proposes that existing protocols being
140	   internationalized, and that need an escape mechanism, SHOULD use some
141	   contextually-appropriate variation on references to code points as
142	   described in Section 2 unless other considerations outweigh those
143	   described here.

145	   This recommendation is not applicable to protocols that already
146	   accept native UTF-8 or some other encoding of Unicode.  In general,
147	   when protocols are internationalized, it is preferable to accept
148	   those forms rather than using escapes.  This recommendation applies
149	   to cases, including transition arrangements, in which that is not
150	   practical.

152	   In addition to the protocol contexts addressed in this specification,
153	   escapes to represent Unicode characters also appear in presentations
154	   to users, i.e., in user interfaces (UI).  The formats specified in,
155	   and the reasoning of, this document may be applicable in UI contexts
156	   as well, but this is not a proposal to standardize UI or presentation
157	   forms.

159	   This document does not make general recommendations for processing
160	   Unicode strings or for their contents.  It assumes that the strings
161	   that one might want to escape are valid and reasonable and that the
162	   definition of "valid and reasonable" is the province of other
163	   documents.  Recommendations about general treatment of Unicode
164	   strings may be found in many places, including the Unicode Standard
165	   itself and the W3C Character Model [W3C-CharMod] as well as specific
166	   rules in individual protocols.

168	1.2.  Terminology

170	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
171	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
172	   document are to be interpreted as described in [RFC2119].

174	   Additional Unicode-specific terminology appears in [UnicodeGlossary],
175	   but is not necessary for understanding this specification.

177	1.3.  Discussion List

179	   Discussion of this document should be addressed to the
180	   discuss@apps.ietf.org mailing list.

182	2.  Encodings that Represent Unicode Code Points: Code Position versus
183	    UTF-8 or UTF-16 Octets

185	   There are two major families of ways to escape Unicode characters.
186	   One uses the code point in some representation (see the next
187	   section), the other encodes the octets of the UTF-8 encoding or some
188	   other encoding in some representation.  Some other options are
189	   possible, but they have been rare in practice.  This specification
190	   recommends that, in the absence of compelling reasons to do
191	   otherwise, the Unicode code points SHOULD be used rather than a
192	   representation of UTF-8 (or UTF-16) octets.  There are several
193	   reasons for this, including:

195	   o  One reason for the success of many IETF protocols is that they use
196	      human-interpretable text forms to communicate, rather than
197	      encodings that generally require computer programs (or hand
198	      simulation of algorithms) to decode.  This suggests that the
199	      presentation form should reference the Unicode tables for
200	      characters and to do so as simply as possible.

202	   o  Because of the nature of UTF-8, for a human to interpret a decimal
203	      or hexadecimal numeral representation of UTF-8 octets requires one
204	      or more decoding steps to determine a Unicode code point that can
205	      used to look up the character in a table.  That may be appropriate
206	      in some cases where the goal is really to represent the UTF-8 form
207	      but, in general, it just obscures desired information and makes
208	      errors more likely and debugging harder.

210	   o  Except for characters in the ASCII subset of Unicode (U+0000
211	      through U+007F), the code point form is generally more compact
212	      than forms based on coding UTF-8 octets, sometimes much more
213	      compact.

215	   The same considerations that apply to representation of the octets of
216	   UTF-8 encoding also apply to more compact ACE encodings such as the
217	   "bootstring" encoding [RFC3492] with or without its "Punycode"
218	   profile.

220	   Similar considerations apply to UTF-16 encoding, such as the \uNNNN
221	   form used in Java (See Section 6.3).  While those forms are
222	   equivalent to code point references for the Basic Multilingual Plane
223	   (BMP, Plane 0), a two-stage decoding process is needed to handle
224	   surrogates to access higher planes.

226	3.  Referring to Unicode Characters

228	   Regardless of what decisions are made about escapes for Unicode
229	   characters in protocol or similar contexts, text referring to a
230	   Unicode code point SHOULD use the U+NNNN[N[N]] syntax, as specified
231	   in the Unicode Standard, where the NNNN... string consists of
232	   hexadecimal numbers.  Text actually containing a Unicode character
233	   SHOULD use a syntax more suitable for automated processing.

235	4.  Syntax for Code Point Escapes

237	   There are many options for code point escapes, some of which are
238	   summarized below.  All are equivalent in content and semantics -- the
239	   differences lie in syntax.  The best choice of syntax for a
240	   particular protocol or other application depends on that application:
241	   one form may simply "fit" better in a given context than others.  It
242	   is clear, however, that hexadecimal values are preferable to other
243	   alternatives: Systems based on decimal or octal offsets SHOULD NOT be
244	   used.

246	   Since this specification does not recommend one specific syntax,
247	   protocols specifications that use escapes MUST define the syntax they
248	   are using, including any necessary escapes to permit the escape
249	   sequence to be used literally.

251	   The application designer selecting a format should consider at least
252	   the following factors:

254	   o  If similar or related protocols already use one form, it may be
255	      best to select that form for consistency and predictability.

257	   o  A Unicode code point can fall in the range from U+0000 to
258	      U+10FFFF.  Different escape systems may use four, five, six, or
259	      eight hexadecimal digits.  To avoid clever syntax tricks and the
260	      consequent risk of confusion and errors, forms that use explicit
261	      string delimiters are generally preferred over other alternatives.
262	      In many contexts, symmetric paired delimiters are easier to
263	      recognize and understand than visually-unrelated ones.

265	   o  Syntax forms starting in "\u", without explicit delimiters, have
266	      been used in several different escape systems, including the four
267	      or eight digit syntax of C [ISO-C] (see Section 6.1), the UTF-16
268	      encoding of Java [Java] (see Section 6.3), and some arrangements
269	      that may follow the "\u" with four, five, or six digits.  The
270	      possible confusion about which option is actually being used may
271	      argue against use of any of these forms.

273	   o  Forms that require decoding surrogate pairs share most of the
274	      problems that appear with encoding of UTF-8 octets.  Internet
275	      protocols SHOULD NOT use surrogate pairs.

277	5.  Recommended Presentation Variants for Unicode Code Point Excapes

279	   There are a number of different ways to represent a Unicode code
280	   point position.  No one of them appears to be "best" for all
281	   contexts.  In addition, when an escape is needed for the escape
282	   mechanism itself, the optimal one of those might differ from one
283	   context to another.

285	   Some forms that are in popular use and that might reasonably be
286	   considered for use in a given protocol are described below and
287	   identified with a current-use context when feasible.  The two in this
288	   section are recommended for use in Internet Protocols.  Other popular
289	   ones appear in Section 6 with some discussion of their disadvantages.

291	5.1.  Backslash-U with Delimiters

293	   One of the recommended forms is a variation of the many forms that
294	   start in "\u" (See, e.g., Section 6.1, below>), but uses explicit
295	   delimiters for the reasons discussed elsewhere.

297	   Specifically, in ABNF [RFC4234],

299	   EmbeddedUnicodeChar =  %x5C.75.27 4*6HEXDIG %x27
300	      ; starting with lower case "\u and "'" and ending with "'".
301	      ; Note that the encodings are considered to be abstractions
302	      ; for the relevant characters, not designations of specific
303	      ; octets.

305	   HEXDIG =  "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" /
306	      "A" / "B" / "C" / "D" / "E" / "F"
307	      ; effectively identical with definition in RFC 4234.

309	   Protocol designers of applications using this form should specify a
310	   way to escape the introducing backslash ("\") if needed. "\\" is one
311	   obvious possibility, but not the only one.

313	5.2.  XML and HTML

315	   The other recommended form is the one used in XML.  It uses the form
316	   "&#xNNNN;".  Like the Perl form (Section 6.2), this form has a clear
317	   ending delimiter, reducing ambiguity.  HTML uses a similar form, but
318	   the semicolon may be omitted in some cases.  If that is done, the
319	   advantages of the the delimiter disappear so the HTML form without
320	   the semicolon SHOULD NOT be used.  However, this format is often
321	   considered ugly and awkward outside of its native HTML, XML, and
322	   similar contexts.

324	   In ABNF:

326	   EmbeddedUnicodeChar =   %x26.23.78 2*6HEXDIG %x3B
327	      ; starts with "&#x" and ends with ";"

329	   Note that a literal "&" can be expressed by "&#x26;" when using this
330	   style.

332	6.  Forms that are Normally Not Recommended

334	6.1.  The C Programming Language: Backslash-U

336	   The forms

338	      \UNNNNNNNN (for any Unicode character) and

340	      \uNNNN (for Unicode characters in plane 0)

342	   are utilized in the C Programming Language [ISO-C] when an ASCII
343	   escape for embedded Unicode characters is needed.

345	   There are disadvantages of this form which may be significant.
346	   First, the use of a case variation (between "u" for the four digit
347	   form and "U" for the eight digit form) may not seem natural in
348	   environments in which upper and lower case characters are generally
349	   considered equivalent and might be confusing to people who are not
350	   very familiar with Latin-based alphabets (although those people might
351	   have even more trouble reading relevant English text and
352	   explanations).  Second, as discussed in Section 4 the very fact that
353	   there are several different conventions that start in \u or \U may
354	   become a source of confusion as people make incorrect assumptions
355	   about what they are looking at.

357	6.2.  Perl: A Hexadecimal String

359	   Perl uses the form \x{NNNN...}.  The advantage of this form is that
360	   there are explicit delimiters, resolving the issue of having
361	   variable-length strings or using the case-change mechanism of the
362	   proposed form to distinguish between Plane 0 and more general forms.
363	   Some other programming languages would tend to favor X'NNNN...' forms
364	   for hexadecimal strings and perhaps U'NNNN...' for Unicode-specific
365	   strings, but those forms do not seem to be in use around the IETF.

367	   Note that there is a possible ambiguity in how two-character or low-
368	   numbered sequences in this notation are understood, i.e., that octets
369	   in the range \x(00) through \x(FF) may be construed as being in the
370	   local character set, not as Unicode code points.  Because of this
371	   apparent ambiguity, and because IETF documents do not contain
372	   provision for pragmas (see [PERLUniIntro] for more information about
373	   the "encoding" pragma in Perl and other details) the Perl forms
374	   should be used with extreme caution if at all.

376	6.3.  Java: Escaped UTF-16

378	   Java [Java] uses the form \uNNNN, but as a reference to UTF-16
379	   values, not Unicode code points.  While it uses a syntax similar to
380	   that described in Section 6.1, this relationship to UTF-16 makes it,
381	   in many respects, more similar to the encodings of UTF-8 discussed
382	   above than to an escape that designates Unicode code points.  Note
383	   that the UTF-16 form, and hence the Java escape notation, can
384	   represent characters outside Plane 0 (i.e., above U+FFFF) only by the
385	   use of surrogate pairs, raising some of the same issues as the use of
386	   UTF-8 octets discussed above.  For characters in Plane 0, the Java
387	   form is indistinguishable from the Plane 0-only form described in
388	   Section 6.1.  If only for that reason, it SHOULD NOT be used as an
389	   escape except in those Java contexts in which it is natural.

391	7.  IANA Considerations

393	   This document specifies no actions for IANA.

395	8.  Security Considerations

397	   This document proposes a set of rules for encoding Unicode characters
398	   when other considerations do not apply.  Since all of the recommended
399	   encodings are unambiguous and normalization issues are not involved,
400	   it should not introduce any security issues that are not present as a
401	   result of simple use of non-ASCII characters, no matter how they are
402	   encoded.  The mechanisms suggested should slightly lower the risks of
403	   confusing users with encoded characters by making the identity of the
404	   characters being used somewhat more obvious than some of the
405	   alternatives.

407	   An escape mechanism such as the one specified in this document can
408	   allow characters to be represented in more than one way.  Where
409	   software interprets the escaped form, there is a risk that security
410	   checks, and any necessary checks for, e.g., minimal or normalized
411	   forms, are done at the wrong point.

413	9.  Acknowledgments

415	   This document was produced in response to a series of discussions
416	   within the IETF Applications Area and as part of work on email
417	   internationalization and internationalized domain name updates.  It
418	   is a synthesis of a large number of discussions, the comments of the
419	   participants in which are gratefully acknowledged.  The help of Mark
420	   Davis in constructing a list of alternative presentations and
421	   selecting among them was especially important.

423	   Tim Bray, Peter Constable, Stephane Bortzmeyer, Chris Newman, Frank
424	   Ellermann, Clive D.W. Feather, Philip Guenther, Bjoern Hoehrmann,
425	   Simon Josefsson, Bill McQuillan, der Mouse, Phil Pennock, and Julian
426	   Reschke provided careful reading and some corrections and suggestions
427	   on the various drafts.  Taken together, their suggestions motivated
428	   the significant revision of this document and its recommendations
429	   between version -00 and version -01 and further improvements in the
430	   subsequent versions.

432	10.  Change log

434	   [[anchor9: RFC Editor: Please remove this section before
435	   publication.]]

437	10.1.  Changes in -01

439	   o  Corrected ABNF syntax for Hex-quad and Full-form.

441	10.2.  Major Changes in -02

443	   This version removes the recommendation of a particular format,
444	   discussing several of them and indicating considerations in making a
445	   choice.

447	10.3.  Major Changes in -03

449	   This version improves the ABNF and adds it for more of the escape
450	   techniques.  It also contains several editorial and contextual
451	   changes.

453	10.4.  Major Changes in -04

455	   o  Updated this section to reflect the changes in -02 and -03.

457	   o  Modified the structure of the document to explicitly recommend the
458	      "\u'[N[N]]NNNN'" and XML forms (still trying to make a
459	      recommendation, not just a list).

461	   o  Clarified the description of the Perl form, added a reference, and
462	      warned about the ambiguity with single octets.

464	   o  Some additional editorial changes for clarity.

466	10.5.  Changes in -05

468	   Moved syntax for the "not recommended" forms to an appendix.

470	10.6.  Changes in -06

472	   o  Added syntax for Java to appendix, per Clive Feather.

474	   o  Added discussion of escapes for \ in the backslash-U case
475	      (Section 5.1) per Frank Ellerman.

477	   o  Moved the definition of HEXDIG from the appendix to the normative
478	      Section 5.1.

480	   o  Small editorial and layout corrections.

482	10.7.  Changes in -07

484	   Version 06 was the IETF Last Call version.  This version reflects
485	   changes made as the result of comments made during and after Last
486	   Call.

488	   o  Changed reference name for the Perl document to conform to
489	      conventional Perl usage.

491	   o  Changed terminology in Section 2 to better align with Unicode
492	      Standard terminology.

494	Appendix A.  Formal Syntax for Forms Not Recommended

496	   While the syntax for the escape forms that are not recommended above
497	   (see Section 6), are not given inline in the hope of discouraging
498	   their use, they are provided in this appendix in the hope that those
499	   who choose to use them will do so consistently.  The reader is
500	   cautioned that some of these forms are not defined precisely in the
501	   original specifications and that others have evolved over time in
502	   ways that are not precisely consistent.  Consequently, these
503	   definitions are not normative and may not even precisely match
504	   reasonable interpretations of their sources.

506	   The definition of "HEXDIG" for the forms that follow appears in
507	   Section 5.1.

509	Appendix A.1.  The C Programming Language Form

511	   Specifically, in ABNF [RFC4234],
512	   EmbeddedUnicodeChar =  BMP-form / Full-form

514	   BMP-form =  %x5C.75 4HEXDIG ; starting with lower case "\u"
515	      ; The encodings are considered to be abstractions for the
516	      ; relevant characters, not designations of specific octets.

518	   Full-form =  %x5C.55 8HEXDIG ; starting with upper case "\U"

520	Appendix A.2.  Perl Form

522	   EmbeddedUnicodeChar =   %x5C.78 "{" 2*6HEXDIG "}" ; starts with "\x"

524	Appendix A.3.  Java Form

526	   EmbeddedUnicodeChar =   %x5C.7A 4HEXDIG ; starts with "\u"

528	11.  References

530	11.1.  Normative References

532	   [ISO10646]
533	              International Organization for Standardization,
534	              "Information Technology - Universal Multiple- Octet Coded
535	              Character Set (UCS)"", ISO/IEC 10646:2003, December 2003.

537	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
538	              Requirement Levels", BCP 14, RFC 2119, March 1997.

540	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
541	              10646", STD 63, RFC 3629, November 2003.

543	   [RFC4234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
544	              Specifications: ABNF", RFC 4234, October 2005.

546	   [Unicode]  The Unicode Consortium, "The Unicode Standard, Version
547	              5.0", 2006.

549	              (Addison-Wesley, 2006.  ISBN 0-321-48091-0).

551	11.2.  Informative References

553	   [ASCII]    American National Standards Institute (formerly United
554	              States of America Standards Institute), "USA Code for
555	              Information Interchange", ANSI X3.4-1968, 1968.

557	              ANSI X3.4-1968 has been replaced by newer versions with
558	              slight modifications, but the 1968 version remains
559	              definitive for the Internet.

561	   [ISO-C]    International Organization for Standardization,
562	              "Information technology --  Programming languages -- C",
563	              ISO/IEC 9899:1999, 1999.

565	   [Java]     Sun Microsystems, Inc., "Java Language Specification,
566	              Third Edition", 2005, <http://java.sun.com/docs/books/jls/
567	              third_edition/html/lexical.html#95413p>.

569	   [PERLUniIntro]
570	              Hietaniemi, J., "perluniintro", Perl documentation  5.8.8,
571	              2002, <http://perldoc.perl.org/perluniintro.html>.

573	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
574	              Languages", BCP 18, RFC 2277, January 1998.

576	   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
577	              for Internationalized Domain Names in Applications
578	              (IDNA)", RFC 3492, March 2003.

580	   [UnicodeGlossary]
581	              The Unicode Consortium, "Glossary of Unicode Terms",
582	              June 2007, <http://www.unicode.org/glossary>.

584	   [W3C-CharMod]
585	              Duerst, M., "Character Model for the World Wide Web 1.0",
586	              W3C Recommendation, February 2005,
587	              <http://www.w3.org/TR/charmod/>.

589	Author's Address

591	   John C Klensin
592	   1770 Massachusetts Ave, #322
593	   Cambridge, MA  02140
594	   USA

596	   Phone: +1 617 245 1457
597	   Email: john-ietf@jck.com

599	Full Copyright Statement

601	   Copyright (C) The IETF Trust (2007).

603	   This document is subject to the rights, licenses and restrictions
604	   contained in BCP 78, and except as set forth therein, the authors
605	   retain all their rights.

607	   This document and the information contained herein are provided on an
608	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
609	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
610	   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
611	   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
612	   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
613	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

615	Intellectual Property

617	   The IETF takes no position regarding the validity or scope of any
618	   Intellectual Property Rights or other rights that might be claimed to
619	   pertain to the implementation or use of the technology described in
620	   this document or the extent to which any license under such rights
621	   might or might not be available; nor does it represent that it has
622	   made any independent effort to identify any such rights.  Information
623	   on the procedures with respect to rights in RFC documents can be
624	   found in BCP 78 and BCP 79.

626	   Copies of IPR disclosures made to the IETF Secretariat and any
627	   assurances of licenses to be made available, or the result of an
628	   attempt made to obtain a general license or permission for the use of
629	   such proprietary rights by implementers or users of this
630	   specification can be obtained from the IETF on-line IPR repository at
631	   http://www.ietf.org/ipr.

633	   The IETF invites any interested party to bring to its attention any
634	   copyrights, patents or patent applications, or other proprietary
635	   rights that may cover technology that may be required to implement
636	   this standard.  Please address the information to the IETF at
637	   ietf-ipr@ietf.org.

639	Acknowledgment

641	   Funding for the RFC Editor function is provided by the IETF
642	   Administrative Support Activity (IASA).