idnits 2.17.1 

draft-weider-iab-char-wrkshop-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-26) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 59 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 27 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack an Authors' Addresses Section.

  ** The abstract seems to contain references ([ISO-8859], [ISO-10646],
     [RFC1958], [ISO-7498], [ASCII], [RFC-1766], [SMTP], [ISO-2022], [UTF-7],
     [MIME], [UTF-8], [POSIX], [HTTP], [HTML], [Base64], [IANA]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 178: '...otocol machinery SHOULD NOT be changed...'
     RFC 2119 keyword, line 205: '...UTF-7 [UTF-7] MUST be available....'
     RFC 2119 keyword, line 208: '...lications; protocols SHOULD attempt to...'
     RFC 2119 keyword, line 220: '...vital, and MUST be supported....'
     RFC 2119 keyword, line 522: '...ry for decoding, stored text SHOULD be...'
     (8 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 81 has weird spacing: '...tion of  this...'

  == Line 138 has weird spacing: '...certain  types...'

  == Line 144 has weird spacing: '...llowing  issue...'

  == Line 395 has weird spacing: '...ith the  excep...'

  == Line 474 has weird spacing: '... in the  proto...'

  == (12 more instances...)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (15 October 1996) is 10055 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'MIME' on line 1082 looks like a reference

  -- Missing reference section? 'POSIX' on line 1087 looks like a reference

  -- Missing reference section? 'SMTP' on line 1117 looks like a reference

  -- Missing reference section? 'ASCII' on line 1040 looks like a reference

  -- Missing reference section? 'RFC 1958' on line 1111 looks like a reference

  -- Missing reference section? 'UTF-8' on line 1127 looks like a reference

  -- Missing reference section? 'UTF-7' on line 1123 looks like a reference

  -- Missing reference section? 'HTML' on line 1049 looks like a reference

  -- Missing reference section? 'ISO-7498' on line 1064 looks like a reference

  -- Missing reference section? 'RFC-1766' on line 310 looks like a reference

  -- Missing reference section? 'ISO-10646' on line 1078 looks like a
     reference

  -- Missing reference section? 'ISO-8859' on line 1067 looks like a reference

  -- Missing reference section? 'ISO-2022' on line 1061 looks like a reference

  -- Missing reference section? 'Base64' on line 1043 looks like a reference

  -- Missing reference section? 'IANA' on line 1097 looks like a reference

  -- Missing reference section? 'HTTP' on line 1052 looks like a reference

  -- Missing reference section? 'SGML' on line 1114 looks like a reference

  -- Missing reference section? 'CEN' on line 1047 looks like a reference

  -- Missing reference section? 'RFC-1345' on line 609 looks like a reference

  -- Missing reference section? 'RFC-1554' on line 1102 looks like a reference

  -- Missing reference section? 'I18N' on line 1055 looks like a reference

  -- Missing reference section? 'RFC 1345' on line 1099 looks like a reference

  -- Missing reference section? 'RFC 1766' on line 1108 looks like a reference

  -- Missing reference section? 'Unicode' on line 1120 looks like a reference


     Summary: 11 errors (**), 0 flaws (~~), 9 warnings (==), 27 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	The Report of the IAB Character Set Workshop
2	held 29 February - 1 March, 1996
3	INTERNET-DRAFT - version 3.3 15 October 1996
4	<draft-weider-iab-char-wrkshop-00.txt>
5	Expire in six months

7	Chris Weider,  Chair
8	Cecilia Preston, Preston & Lynch
9	Keld Simonsen, DKUUG
10	Harald Alvestrand, UNINETT
11	Ran Atkinson, Cisco Systems
12	Mark Crispin, University of Washington
13	Peter Svanberg, KTH

15	Acknowledgments

17	The authors would like to sincerely thank Information Science
18	Institute (ISI), and in particular Joyce Reynolds for graciously
19	hosting this event; Joe Kemp and Jeanine Yamazaki of ISI made sure the
20	facilities met our needs.  We also wish to thank the Internet Society,
21	which underwrote travel for participants who might not otherwise have
22	been able to attend.  Of course, we also wish to thank the many
23	experts who participated in the workshop and on the mailing list; a
24	complete list of these people can be found in Appendix D.  Bunyip
25	Information Systems was kind enough to provide mailing list facilities
26	for this work.

28	Table of Contents

30	Abstract
31	0:    Executive summary
32	1:    Introduction
33	2:    Character sets on the Internet -- the problem today
34	2.1:  Character set handling in existing protocols
35	3:    The model
36	3.1:  Components of the model
37	3.2:  Recommended defaults
38	3.3:  Guidelines for conversions between coded character sets
39	4:    Presentation issues
40	5:    Open issues
41	6:    Security considerations
42	7:    Conclusions
43	8:    Recommendations
44	8.1:  To the IAB
45	8.2:  For new Internet protocols
46	8.3:  For registration of new character sets

48	Appendix A: List of protocols affected by character set issues
49	Appendix B: Acronyms
50	Appendix C: Glossary
51	Appendix D: References
52	Appendix E: Recommended reading
53	Appendix F: Workshop attendee list
54	Appendix G: Author's addresses
55	Abstract

57	This report details the conclusions of an IAB-sponsored invitational
58	workshop held 29 February  - 1 March, 1996, to discuss the use of
59	character sets on the Internet.  It motivates the need to have
60	character set handling in Internet protocols which transmit text,
61	provides a conceptual framework for specifying character sets,
62	recommends the use of MIME tagging for transmitted text, recommends a
63	default character set *without* stating that there is no need for
64	other character sets, and makes a series of recommendations to the
65	IAB, IANA, and the IESG for furthering the integration of the
66	character set framework into text transmission protocols.

68	0: Executive summary

70	The term 'Character Set' means many things to many people. Even the
71	MIME registry of character sets registers items that have great
72	differences in semantics and applicability. This workshop provides
73	guidance to the IAB and IETF about the use of character sets on the
74	Internet and provides a common framework for interoperability between
75	the many characters in use there.

77	The framework consists of four components: an architecture model, which
78	specifies components necessary for on-the-wire transmission of text;
79	recommendations for tagging transmitted (and stored) text; recommended
80	defaults for each level of the model; and a set of recommendations to
81	the IAB, IANA, and the IESG for furthering the integration of  this
82	framework into text transmission protocols.

84	The architectural model specifies 7 layers, of which only three are
85	required for on-the-wire transmission. The Coded Character Set is a
86	mapping from a set of abstract characters to a set of integers. The
87	Character Encoding Scheme is a mapping from a Coded Character Set (or
88	several) to a set of octets. The Transfer Encoding Syntax is a
89	transformation applied to data which has been encoded using a
90	Character Encoding Scheme to allow it to be transmitted. These layers
91	should be specified in a transmitted text stream by using the MIME
92	encoding mechanisms.

94	This report recommends the use of ISO 10646 as the default Coded
95	Character Set, and UTF-8 as the default Character Encoding Scheme in
96	the creation of new protocols or new version of old protocols which
97	transmit text. These defaults do not deprecate the use of other
98	character sets when and where they are needed; they are simply
99	intended to provide guidance and a specification for interoperability.

101	1:  Introduction

103	This is the report of an IAB-sponsored invitational workshop on the
104	use of Character Sets on the Internet, held 29 February - 1 March 1996
105	at Information Science Institute (ISI) in Marina del Rey, California.
106	In addition, this report covers the discussion on the mailing list up
107	to and slightly beyond the workshop itself.  The goals of this
108	workshop were to provide guidance to the IAB and the IETF about the
109	use of character sets on the Internet, and if possible a common
110	framework for interoperability between the many character sets in use
111	there.  Both goals were achieved.

113	2:  Character sets on the Internet - the problem

115	The term 'character set' is typically applied to the contents of a
116	wide variety of text transmission and display protocols used on the
117	Internet.  Because the term is used to mean different things,
118	confusion has arisen.  For example, the MIME registry of character
119	sets [MIME] contains items that may differ greatly in their
120	applicability and semantics in various Internet protocols.

122	In addition, there is a vast profusion of different text encoding
123	schemes in use on the Internet.  This per se is not a problem; each
124	scheme has evolved to meet real needs.  However, information
125	applications such as mail, directories, and the World Wide Web have
126	each developed different techniques for dealing with the growing number
127	of schemes.  A robust information architecture for the Internet
128	requires as much interoperability between these techniques as possible.

130	2.1:  Related topics deemed out of scope for this workshop

132	Successful display of plain text transmitted over the Internet requires
133	a lot of information about the text itself, such as the underlying
134	character set, language, and so forth.  An additional set of formatting
135	information is needed if the receiving application wishes to use local
136	(cultural) conventions when it presents the data to the user.  This
137	formatting includes information, that provides the data necessary to
138	format certain  types of textual data (dates, times, numbers and
139	monetary notation) into a form which is familiar to the user.  The POSIX
140	[POSIX] notation of locale encompasses language, coded character set and
141	cultural conventions.

143	To avoid unfruitful discussion, and to make the best use of the time
144	available for the workshop, we declared the following  issues out of
145	scope for the purposes of this workshop:

147	-  glyphs
148	-  sorting
149	-  culture (e.g. do we present the American or British spelling?)
150	-  user interface issues
151	-  internal representation of textual data
152	-  included characters (why aren't certain characters available in
153	       any character set?)
154	-  locale (in the POSIX sense)
155	-  font registration
156	-  semantics
157	-  user input/output issues
158	-  Han unification issues
159	There are some related issues which were included for discussion, most
160	importantly the 'locale' components necessary for transport and
161	identification of multilingual texts.

163	2.2:  Character Set handling in existing protocols

165	One of the group's overriding concerns was that the framework
166	developed for character set handling not break existing protocols.
167	With that in mind, the way character sets are being used in existing
168	protocols was examined.  See Appendix A for a list of those protocols
169	and some recommendations for change.

171	2.2.1:  General comments

173	The problem areas here fall into three main categories: protocols,
174	identifiers, and data.

176	2.2.1.1:  Protocols

178	The protocol machinery SHOULD NOT be changed; allowing, for instance,
179	SMTP [SMTP] to use both MAIL FROM and POST FRA is dangerous to the
180	protocols' stability.  However, many protocols carry error messages
181	and other information that is intended for human consumption; it MIGHT
182	be an advantage to allow these to be localized into a specific
183	language and character set, rather than staying in English and
184	US-ASCII [ASCII].  If this is done, new extensions should follow the
185	framework outlined below.

187	2.2.1.2:  Identifiers.

189	There is a strong statement of direction from the IAB, RFC 1958
190	[RFC 1958],  which states:

192	     4.3 Public (i.e. widely visible) names should be in case
193	         independent ASCII.  Specifically, this refers to DNS names,
194	         and to protocol elements that are transmitted in text format.
195	         ...
196	     5.4 Designs should be fully international, with support for
197	         localization (adaptation to local character sets). In
198	         particular, there should be a uniform approach to character
199	         set tagging for information content.

201	In protocols that up to now have used US-ASCII only, UTF-8 [UTF-8]
202	forms a simple upgrade path; however, its use should be negotiated
203	either by negotiating a protocol version or by negotiating charset
204	usage, and a fallback to a US-ASCII compatible representation such as
205	UTF-7 [UTF-7] MUST be available.

207	The need for passing application data such as language on individual
208	identifiers varies between applications; protocols SHOULD attempt to
209	evaluate this need when designing mechanisms.  Applying the ASCII
210	requirement for identifiers that are only used in a local context
211	(such as private mailbox folder names) is both unrealistic and
212	unreasonable; in such cases, methods for consistency in the handling
213	of character set should be considered.

215	2.2.1.3:  Data

217	Data that require character set handling includes text, databases,
218	and HTML [HTML] pages, for example.  In these the support for multiple
219	character sets and proper application information is absolutely
220	vital, and MUST be supported.

222	2.3:  Architectural requirements

224	To address the issues enumerated for this work, first an architectural
225	model was created which establishes the components that are required
226	to fully specify the transmission of textual data. Many of these
227	components are already familiar to the users of encoding protocols
228	such as MIME.  Not all of these are discussed in detail in this
229	report; we restrict ourselves primarily to those components which are
230	required to specify the 'on-the-wire' phase of text transmission.

232	Mandating a single, all-encompassing character set would not fit well
233	with the IETF philosophy of planning for architectural diversity.  So,
234	the best that can be done is to provide a common *framework* for
235	identifyin and using the multitude of character sets available on the
236	Internet.  It would be an advantage if the total number of Coded
237	Character Sets could be kept to a minimum.  This framework should meet
238	the following requirements:

240	-  it should not break existing protocols (because then the likelihood
241	     of deployment is very small),
242	-  it should allow the use of character sets currently used on the
243	     Internet, and
244	-  it should be relatively easy to build into new protocols.

246	3:  Architectural model

248	The basic architectural model which guided our discussions is shown in
249	below.  A distinction was made between those segments which were
250	necessary to successfully transmit character set data on-the-wire and
251	those needed to present that data to a user in a comprehensible manner.
252	The discussions were primarily restricted to those segments of the model
253	which specify the 'on-the-wire' transmission of textual data.

255	User interface issues: these are briefly discussed in Section 3.1.1.
256	     Layout
257	     Culture
258	     Locale
259	     Language
260	On-the-wire: see section 3.2 for detailed discussion.
261	     Transfer Syntax
262	     Character Encoding Scheme
263	     Coded Character Set

265	3.1:  Segments defined

267	3.1:1:  User interface

269	3.1.1.1:  Layout

271	Layout includes the elements needed for displaying text to the user,
272	such as font selection, word-wrapping, etc.  It is similar to the
273	'presentation' layer in the 7-layer ISO telecommunications model
274	[ISO-7498].

276	3.1.1.2:  Culture

278	Culture includes information about cultural preferences, which affect
279	spelling, word choice, and so forth.

281	3.1.1.3:  Locale

283	The locale component includes the information necessary to make choices
284	about text manipulation which will present the text to the user in an
285	expected format.  This information may include the display of date, time
286	and monetary symbol preferences.  Notice that locale modifications are
287	typically applied to a text stream before it is presented to the user,
288	although they also are used to specify input formats.

290	3.1.1.4:  Language

292	This component specifies the language of the transmitted text.  At
293	times and in specific cases, language information may be required to
294	achieve a particular level of quality for the purpose of displaying a
295	text stream.  For example, UTF-8 encoded Han may require transmission
296	of a language tag to select the specific glyphs to be displayed at a
297	particular level of quality.

299	Note that information other than language may be used to achieve the
300	required level of quality in a display process.  In particular, a font
301	tag is sufficient to produce identical results.  However, the
302	association of a language with a specific block of text has usefulness
303	far beyond its use in display.  In particular, as the amount of
304	information available in multiple languages on the World Wide Web
305	grows, it becomes critical to specify which language is in use in
306	particular documents, to assist automatic indexing and retrieval of
307	relevant documents.

309	The term 'language tag' should be reserved for the short identifier of
310	RFC 1766 [RFC-1766] that only serves to identify the language.  While
311	there may be other text attributes intimately associated with the
312	language of the document, such as desired font or text direction,
313	these should be specified with other identifiers rather than
314	overloading the language tag.

316	3.2:  On the wire

318	There are three segments of the model which are required for
319	completely specifying the content of a transmitted text stream (with
320	the occasional exception of the Language component, mentioned above).
321	These components are:

323	1)  Coded Character Set,
324	2)  Character Encoding Scheme, and
325	3)  Transfer Encoding Syntax.

327	Each of these abstract components must be explicitly specified by the
328	transmitter when the data is sent.  There may be instances of an
329	implicit specification due to the protocol/standard being used (i.e.
330	ANSI/NISO Z39.50).  Also, in MIME, the Coded Character Set and Character
331	Encoding Scheme are specified by the Charset parameter to the
332	Content-Type header field, and Transfer Encoding Syntax is specified by
333	the Content-Transfer-Encoding header field.

335	3.2.1:  Coded Character Set

337	A Coded Character Set (CCS) is a mapping from a set of abstract
338	characters to a set of integers.  Examples of coded character sets are
339	ISO 10646 [ISO-10646], US-ASCII [ASCII], and ISO-8859 series
340	[ISO-8859].

342	3.2.2:  Character Encoding Scheme

344	A Character Encoding Scheme (CES) is a mapping from a Coded Character
345	Set or several coded character sets to a set of octets. Examples of
346	Character  Encoding Schemes are ISO-2022 [ISO-2022] and UTF-8 [UTF-8].

348	3.2.3:  Transfer Encoding Syntax

350	It is frequently necessary to transform encoded text into a format
351	which is transmissible by specific protocols.  The Transfer Encoding
352	Syntax (TES) is a transformation applied to character data
353	encoded using a CCS and possibly a CES to allow it to be transmitted.
354	Examples of Transfer Encoding Syntaxes are Base64 Encoding [Base64],
355	gzip encoding, and so forth.

357	3.3:  Determining which values of CCS, CES, and TES are used

359	To completely specify which CCS, CES, and TES are used in a specific
360	text transmission, there needs to be a consistent set of labels for
361	specifying which CCS, CES, and TES are used.  Once the appropriate
362	mechanisms have been selected, there are six techniques for attaching
363	these labels to the data.

365	The labels themselves are named and registered, either with IANA
366	[IANA] or with some other registry.  Ideally, their definitions are
367	retrievable from some registration authority.

369	Labels may be determined in one of the following ways:

371	-  Determined by guessing, where the receiver of the text has to
372	   guess the values of the CCS, CES, and TES. For example: "I got
373	   this from Sweden so it's probably  ISO-8859-1."  This is
374	   obviously not a very foolproof way to decode text.
375	-  Determined by the standard, where the protocol used to transmit
376	   the data has made documented choices of CCS, CES, and TES in the
377	   standard. Thus, the encodings used are known through the
378	   access protocol, for example HTTP [HTTP] uses (but is not
379	   limited to) ISO-8859-1, SMTP uses US-ASCII.
380	-  Attached to the transfer envelope, where the descriptive labels are
381	   attached to the wrapper placed around the text for transport.
382	   MIME headers are a good example of this technique.
383	-  Included in the data stream, where the data stream itself has
384	   been encoded in such a way as to signal the character set used.
385	   For example, ISO-2022 encodes the data with escape sequences to
386	   provide information on the character subset currently being used.
387	-  Agreed by prior bilateral agreement, where some out-of-band
388	   negotiation has allowed the text transmitter and receiver to
389	   determine the CCS, CES, and  TES for the transmitted text.
390	-  Agreed to by negotiation during some phase, typically initialization
391	   of the protocol.

393	3.3.1:  Recommendations for value specification mechanisms

395	While each of these techniques (with the  exception of guessing) is
396	useful in particular situations, interoperability requires a more
397	consistent set of techniques.  Thus, we recommend that MIME registered
398	values be used for all tagging of character sets and languages UNLESS
399	there is an existing mechanism for determining the required
400	information using one of the other techniques (except guessing).  This
401	recommendation will require a fair bit of work on the part of protocol
402	designers, implementors, the IETF, the IESG, and the IAB.

404	However, it is important to point out that the MIME concept of
405	'charset' in some cases cuts across several layers of components in
406	our model.  While this can be accepted in existing registrations, we
407	also recommend that the MIME registration procedure for character sets
408	be modified to show how a proposed character set deals with the CCS
409	and the CES.

411	There are a number of other recommendations, but these will be covered
412	in the next sections.

414	3.4:  Recommended Defaults

416	For a number of reasons, one cannot define a mandatory set of defaults
417	for all Internet protocols.  There is a mass of current practice,
418	future protocols are likely to have different purposes, which may
419	determine their handling of text, and protocols may need specific
420	variation support.  For example, in mail, text is a predominant data
421	type and coded character sets then become a major issue for the
422	protocol.  Also, since e-mail is ubiquitous and users expect to be
423	able to send it to everyone, the mail protocols need to be quite adept
424	at handling different character set encodings.  On the other hand, if
425	strings are seldom used in a given protocol, there is no need to weigh
426	the protocol down with a sophisticated apparatus for handling multiple
427	character sets, assuming that the predicated character set can handle
428	all the protocol's needs. This observation also applies to the
429	specification techniques for character set parameters.  If only one
430	character set encoding is needed, it can be made explicit in the
431	protocol specification.  Protocols with a  greater need for character
432	set support will need a more elaborate specification technique.

434	3.4.1:  Clarity of specification

436	We recommend that each protocol clearly specify what it is using for
437	each of the layers of the transmission model.  Users (or clients)
438	should never have to guess what the parameter is for a given layer.

440	3.4.2:  Default Coded Character Set:

442	The default Coded Character Set is the repertoire of ISO-10646.

444	3.4.3:   Default Character Encoding Scheme

446	For text-oriented protocols, new protocols should use UTF-8, and
447	protocols that have a backwards compatibility requirement should use
448	the default of the existing protocol, e.g. US-ASCII for mail, and
449	ISO-8859-1 for HTTP.  The recommended specification scheme is the MIME
450	"charset" specification, using the IANA "charset" specifications.  The
451	MIME specifications will need to be clarified to meet this model in
452	the future.

454	For other protocols, the default should be UTF-8 as this initially
455	allows US-ASCII to be entered as-is, and enables the full repertoire
456	of ISO 10646.

458	Some protocols, such as those descended from SGML [SGML], have other
459	natural notations for characters outside their "natural" repertoire;
460	for instance, HTML [HTML] allows the use of &#nnnn to refer to any ISO
461	10646 character.  Note that this, like all other encodings that depend
462	on "escape characters", redefines at least one character from the base
463	character set for use as an indicator of "foreign" characters.  Use of
464	this approach must be weighed very carefully.

466	3.4.4:   Default Transport Encoding Scheme

468	There is no recommended default for this level.  For plain text
469	oriented protocols, the bytestream transport format should be 8-bit
470	clean, possibly with normalization of end-of-line indicators.  Some
471	special cases could be made for protocols that are not 8-bit clean,
472	such as encoding it for transport over 7-bit connections.  For binary
473	the same recommendation holds as above.  The specification technique
474	should either be defined in the  protocol, if only one way is
475	permitted, or by use of MIME content-transfer-encoding (CTE)
476	techniques, using IANA registered values.

478	3.4.5:  Default Language

480	There is no recommended default for the language level.  For human
481	readable text, there should always be a way to specify the natural
482	language. The specification technique should be a MIME identifier with
483	IANA  registered values for languages.  If headers are used, the
484	header should be 'Content-Language'

486	3.4.6:  Default Locale

488	The default should be the POSIX locale.  The specification technique
489	should use the Cultural register of CEN ENV 12005 [CEN] for the values.
490	If headers are used, the header should be 'Content-Locale'.

492	3.4.7:  Default Culture

494	There is no recommended default for the Culture level.  The
495	specification  technique should be a MIME or MIME-like identifier
496	(e.g. Content-Culture) and should use the Cultural register of CEN ENV
497	12005 for its values.

499	3.4.8:  Default Presentation

501	There is no recommended default for the Presentation level.  The
502	specification technique should be a MIME or MIME-like identifier (e.g.
503	Content-Layout) and use the glyph register of ISO 10036 and other
504	registers for its values.

506	3.4.9:  Multiplexing

508	In some cases, text transmission may require the use of a number of
509	different values for a given parameter; for example, English
510	annotation of Japanese text might well require shifting the
511	Content-Language parameter.  The way to switch the value of parameters
512	within a single body of text depends on the application.  For
513	instance, the HTML I18N [I18N]work defines a <SPAN LANG=xx> construct
514	for the purpose of switching between different languages.  When only one
515	value is needed, this value should be as general as possible, and
516	specified in the protocol standard with reference to the IANA or other
517	registry value.  All levels should be specified explicitly.

519	3.4.10:  Storage

521	Because stored text may very well be stored without any of the
522	additional information necessary for decoding, stored text SHOULD be
523	tagged in a MIME compliant fashion.  This alleviates the problem of
524	being unable to interpret text which has been stored for a long time,
525	or text whose provenance is not available.

527	3.5:  Guidelines for conversions between coded character sets

529	This section covers various algorithms to convert a source text S,
530	encoded in the coded character set CCS(S), to a target text T, encoded
531	in the coded character set CCS(T).

533	Rep(X) is the character repertoire of coded character set X, i.e. the
534	set of characters which can be represented with X.

536	3.5.1:  Exact conversion

538	When Rep(CCS(S)) and Rep(CCS(T)) are equal or Rep(CCS(S)) is a subset
539	of Rep(CCS(T)), exact conversion is possible; i.e. T is equal to S.
540	The octets just need to be remapped.  The algorithm for performing
541	this remapping is simple, if the IANA-registered definition tables for
542	CCS(S) and CCS(T) are available.

544	3.5.2:  Approximate conversion

546	In all other cases, any conversion creates a text T which differs from
547	S.  There are different principles for how this inevitable difference
548	should be handled.  A choice between them should be made, depending on
549	the purpose and requirements of the conversion.  Where possible, the
550	client application should be given mechanisms to determine what has
551	been done to the text.

553	3.5.2.1:  Length-modifying conversion for human display

555	When the length of the target text T is allowed to differ from the
556	length of the source text S, one should use a conversion method in
557	which each source character is converted to one or several target
558	character(s), using a best resemblance criteria in the choice of that
559	target character(s).

561	Examples:
562	   LATIN CAPITAL LETTER [*] ->  AE
563	   COPYRIGHT SIGN       [*] -> (c)

565	3.5.2.2:  Length-preserving conversion for human display

567	Where the text T must be presented and the length of T cannot differ
568	from the length of S, one should use a conversion method where each
569	source character is converted to one target character, using some kind
570	of best  resemblance criteria in the choice of target character.

572	Examples:
573	  LATIN CAPITAL LETTER  [*] -> A
574	  COPYRIGHT SIGN        [*] -> C

576	3.5.2.3:  Conversion without data loss

578	Where the conversion of the text S into T must be completely
579	reversible, apply a Character Encoding Syntax or other reversible
580	transformation method.  This case is most frequently met in data
581	storage requirements.

583	Examples:
584	  LATIN CAPITAL LETTER [*] -> &AE
585	  COPYRIGHT SIGN       [*] -> &(C

587	An alternate method, which can be used if the size of Rep(CCS(T)) >=
588	Rep(CCS(S)), then for each character in Rep(CCS(S)) which is not
589	present in Rep(CCS(T)), define a mapping into a character in
590	Rep(CCS(T)) which is not present in Rep(CCS(S)).

592	Examples:
593	  LATIN CAPITAL LETTER  [*] -> CYRILLIC CAPITAL LETTER [*]
594	  COPYRIGHT SIGN  [*] -> PARTIAL DIFFERENTIAL SIGN [*]

596	Note that conversion without data loss requires redefining some member
597	of T to indicate "the introduction of character data outside T".  This
598	effectively adds another level of CES on top of CES(T).

600	4: Presentation issues

602	There are a number of considerations to make in selecting the base
603	character set.  One such consideration is the protocol's convenience
604	to users with limited equipment (for example only ISO 8859-1 or a
605	keyboard without the ability to enter all the characters in ISO
606	10646).  Alternative representation should be considered for these
607	users, both for input and output.  Possible options for the
608	representation of characters that can not be displayed include
609	transliteration (a la CEN/TC304 or ISO TC46/SC2 ), RFC 1345 [RFC-1345]
610	representative icons, or the WG2 short name (u+xxxx).

612	5: Open issues

614	In addition to the issues declared out of scope and enumerated in
615	section 2.1, the following issues are still open and will need to be
616	addressed in other forums.  These issues: language tags, public
617	identifiers such as URL names, and bi-directionality are briefly
618	discussed below as they repeatedly encroached the discussion.

620	5.1 Language tags

622	Although the workshop decided not to explicitly address the so-called
623	"CJK issue", a few members felt it was necessary to have some
624	mechanism to address the problem of correct Han character display in
625	the ISO-10646 issue, and that saying that it was a "font issue" would
626	not suffice.

628	The "CJK issue" refers to the extended discussion about "Han
629	unification", the use of a single ISO-10646 codepoint to represent
630	multiple national variants of a Chinese (Han) character.  ISO-10646
631	can map uniquely to any single CJK national character set, but in the
632	absence of additional  information an application can not display an
633	ISO-10646 text using the proper national variants for that text.

635	It was agreed that language tags would be sufficient to disambiguate
636	unified characters. There was not, in our opinion, a significant
637	technical difference between the use of different coded character sets
638	with overlapping codepoints, and a single coded character set with
639	language tags.  Either way, the application has sufficient information
640	to display the text properly.

642	It was observed that in contemporary usage of MIME charsets, the
643	language is implied as well as the coded character set and the
644	character encoding syntax.  We agreed that this is excessive
645	overloading of MIME charsets.

647	To specify the language used in a particular block of text, we
648	recommend that the MIME tag "Content-Language" be used.  There are a
649	number of questions about this approach that need to be worked out,
650	however:

652	-  Is Content-Language: actually suitable?
653	-  Is there an overload between this function and the other
654	     intended functions of Content-Language: as described in RFC
655	     1766?
656	-  What, precisely, does "Content-Language: zh-tw, ja, ko, zh-cn"
657	     mean in this context? We believe it means that, in drawing a
658	     Han character, the Taiwanese variant (presumably traditional
659	     Han) is preferred, followed by the Japanese, Korean, and
660	     mainland Chinese (presumably simplified Han) variants. It does
661	     *NOT* mean "mixed text containing Taiwanese, Japanese, Korean,
662	     and mainland Chinese text with all the national variants in
663	     each of these".

665	Mixed CJK text, that simultaneously displays different variants
666	occupying the same codepoint, requires language tags embedded in the
667	data.  Ohta and Handa propose in RFC 1554 [RFC-1554] a MIME charset
668	using ISO-2022 shifts between multiple coded character sets; in effect
669	this is an encoding that uses coded character sets for displaying the
670	appropriate glyphs.

672	There is some speculation that states that mixed CJK text is
673	relatively infrequent, and that therefore it is acceptable to require
674	that such text be represented using a rich text format that can
675	support language tags.  In other words, that a simplifying assumption
676	can be made for TEXT/PLAIN in  email using ISO-10646 that will not
677	require multiple display representations for the same codepoint.  A
678	mechanism such as RFC 1554 could address this need if it was
679	important; although arguably RFC 1554 should really be identified as
680	TEXT/ISO-2022.

682	Note again that we recommend that support for language tagging SHOULD
683	be built into new protocols, as this will become a critical component
684	of the automated indexing and retrieval in information applications of
685	the future.

687	5.2:   Public identifiers

689	There is a considerable demand from the user community for the ability
690	to use non-ASCII characters in URL names, IMAP mailbox names, file
691	names, and other public identifiers. This is still an open problem.

693	5.3:   Bi-directionality

695	It was realized that a consistent framework for bi-directional text
696	was needed but there was no attempt to work on it in this workshop.

698	6:  Security considerations

700	There are no security considerations associated with character sets.

702	7:  Conclusions

704	This paper provides a conceptual framework and a set of
705	recommendations which, if adopted, should provide a solid foundation
706	for interoperability on the Internet. There are, however, a number of
707	open issues which will need to be addressed to provide ever better use
708	of text on the Internet.

710	8:  Recommendations

712	8.1:  To the IAB

714	There were a number of recommendations to the IAB about making the
715	standards process more aware of the need for character set
716	interoperability, and about the framework itself.

718	A: The IAB should trigger the examination of all RFCs to determine the
719	way  they handle character sets, and obsolete or annotate the RFCs
720	where necessary.

722	B: The IESG should trigger the recommendation of procedures to the RFC
723	editor  to encourage RFCs to specify character set handling if they
724	specify the  transmission of text.

726	C: The IAB should trigger the production of a perspectives document on
727	the  character set work that has gone on in the past and relate it to
728	the current framework.

730	D: Full ISO 10646 has a sufficiently broad repertoire, and scope for
731	further extension, that it is sufficient for use in Internet Protocols
732	(without excluding the use of existing alternatives).  There is no
733	need for specific development of character set standards for the
734	Internet.

736	E: The IAB should encourage the IRTF to create a research group to
737	explore the open issues of character sets on the Internet. This group
738	should set its sights much higher than this workshop did.

740	F: The IANA (perhaps with the help of an IETF or IRTF group) should
741	develop  procedures for the registration of new character sets for use
742	in the Internet.

744	G: Register UTF-8 as a Character Encoding Scheme for MIME.

746	H: The current use of the "x-*" format for distinguishing experimental
747	tags should be continued for private use among consenting parties. All
748	other namespaces should be allocated by IANA.

750	I: Application protocol RFCs SHOULD include a section on
751	"multilingual Considerations".

753	J: Application Protocol RFCs SHOULD indicate how to transfer 'on the
754	wire' all characters in the character sets they use. They SHOULD also
755	specify how to transfer other information that applications may need
756	to know about the data.

758	K: The IESG should trigger a set of extensions to RFC 1522 to allow
759	language  tagging of the free text parts of message headers.

761	8.2:  For new Internet protocols

763	New protocols do not suffer from the need to be compatible with old
764	7-bit pipes.  New protocol specifications SHOULD use ISO 10646 as the
765	base charset unless there is an overriding need to use a different
766	base character set.

768	New protocols SHOULD use values from the IANA registries when
769	referring to parameter values.  The way these values are carried in
770	the protocols is protocol dependent; if the protocol uses RFC-822-like
771	headers, the header names already in use SHOULD be used.

773	For protocols with only a single choice for each component, the
774	protocol  should use the most general specification and should be
775	specified with reference to the registered value in the protocol
776	standard.

778	Protocols SHOULD tag text streams with the language of the text.

780	8.3:  For the registration of new character sets

782	Ned Freed will be releasing a new MIME registration document in
783	conjunction with this paper.

785	8.3.1:   A definition table for a coded character set

787	A definition table for a coded character set A must for each character
788	C that is in the repertoire of A give:

790	a) If C is present in ISO 10646, the code value (in hexadecimal form)
791	     for that character.

793	b) If C is not present in ISO 10646, but may be constructed using ISO
794	     10646 combining characters, the series of code values (in
795	     hexadecimal form) used to construct that character.

797	c) if C is not present in ISO 10646, a textual description of the
798	     character,  and a reference to its origin.

800	8.3.2:   A definition of a character encoding scheme

802	A definition of a character encoding scheme consists of:

804	-  A description of an algorithm which transforms every possible
805	     sequence of octets to either a sequence of pairs <CCS, code
806	     value> or to the  error state "illegal octet sequence"
807	-  Specifications, either by reference to CCS's registered by IANA or
808	   in text, of each CCS upon which this CES is based.

810	Appendix A:

812	A-1:  IETF Protocols

814	The following list describes how various existing protocols handle
815	multiple character set information.

817	Email

819	   SMTP
820	     See 8.2. ESMTP makes it easy to negotiate the use of alternate
821	     language and encoding if it is needed.
822	   Headers
823	     RFC 1522 forms an adequate framework for supporting text; UTF-8
824	     alone is not a possible solution, because the mail pathways are
825	     assumed to be 7-bit 'forever'. However, RFC 1522 should be extended
826	     to allow language tagging of the free text parts of message
827	     headers.
828	   Bodies
829	     Selection of charset parameters for Email text bodies is
830	     reasonably well covered by the charset= parameter on Text/* MIME
831	     types.  Language is defined by the Content-language header of RFC
832	     1766.  Other information will have to be added using body part
833	     headers; due to the way MIME differentiates between body part
834	     headers and message headers, these will all have to have names
835	     starting with Content- .

837	NetNews

839	   NNTP
840	     See 8.2. No strong tradition for negotiation of encoding in NNTP
841	     exists.
842	   NetNews Messages
843	     These should be able to leverage off the mechanisms defined for
844	     Email.  One difference is that nearly all NNTP channels are 8-bit
845	     clean; some NNTP newsgroups have a tradition of using 8-bit
846	     charsets in both headers and bodies. Defining character set
847	     default on a per newsgroup basis might be a suitable approach.

849	RTCP
850	     The identifiers carried as information about parties are already
851	     defined to be in UTF-8.

853	FTP
854	   Protocol
855	     See 8.2. The common use of welcome banners in the login response
856	     means that there might be strong reason here to allow client and
857	     server to negotiate a language different from the default for
858	     greetings and error messages. This should be a simple protocol
859	     extension.
860	   Filenames
861	     Many fileservers now how have the capability of using non-ASCII
862	     characters in filenames, while the "dir" and "get" commands of

864	Draft RFC          Character Set Workshop Report          November 1996

866	FTP
867	     are defined in terms of US-ASCII only. One possible solution
868	     would be to define a "UTF-8" mode for the transfer of filenames
869	     and directory information; this would need to be a negotiated
870	     facility, with fallback to US-ASCII if not negotiated. The
871	     important point here is consistency between all implementations;
872	     a single charset is better here than the ability to handle
873	     multiple charsets.

875	World Wide Web
876	   HTTP
877	     See 8.2. The single-shot stype of HTTP makes negotiation more
878	     complex than it would otherwise be.
879	   HTML
880	     Internationalization of HTML [I18N] seems fairly well covered in
881	     the current "I18N" document. It needs review to see if it needs
882	     more specific details in order to carry application information
883	     apart from the language.

885	URLs
886	     URLs are "input identifiers", and powerful arguments should be
887	     made if they are ever to be anything but US-ASCII.

889	IMAP
890	     IMAP's information objects are MIME Email objects, and therefore
891	     are able to use that standard's methods. However, IMAP folder
892	     names are local identifiers; there is strong reason to allow
893	     non-ASCII characters in these. A UTF-8 negotiation might be the
894	     most appropriate thing, however, UTF-8 is awkward to use.
895	     Unfortunately, UTF-7 isn't suitable because it conflicts with
896	     popular hierarchy delimiters. The most recent IMAP draft
897	     specification describes a modified UTF-7 which avoids this
898	     problem.

900	DNS
901	     DNS names are the prime example of identifiers that need to stay
902	     in US-ASCII for global interoperability. However, some DNS
903	     information, in particular TXT records, may represent information
904	     (such as names) that is outside the ASCII range. A single
905	     solution is the best; problems resulting from UTF-8 should be
906	     investigated.

908	WHOIS++
909	     WHOIS++ version 1 is defined to use ISO 8859-1. The next version
910	     will use UTF-8. The currently designed changes will also allow the
911	     specification of individual attributes on attribute names; these
912	     will make the passing of application information about the values
913	     (such as language) easier. No immediate action seems necessary.

915	WHOIS
916	     This has been a stable protocol for so many years now that it
917	     seems unwise to suggest that it be modified. Furthermore,
918	compatible extensions exist in RWHOIS and WHOIS++; modification

920	Draft RFC          Character Set Workshop Report          November 1996

922	     should rather be made to these protocols than to the WHOIS
923	     protocol itself.

925	Telnet
926	     This is a prime example of protocol where character set support
927	     is necessary and nonexistent. The current draft on character set
928	     negotiation in Telnet seems adequate to the task; the question of
929	     passing other application data that might be useful is still
930	     open.

932	A-2: Non-IETF protocols

934	For these protocols, the IETF does not have any power to change them.
935	However, the guidelines developed by the workshop may still be useful
936	as input to the further development of the protocols.

938	Gopher: Gopher, Gopher+

940	Prospero (Archie)

942	NFS:  Filesystem

944	CORBA, Finger, GEDI, IRC, ISO 10160/1, Kerberos, LPR, RSTAT, RWhois,
945	SGML, TFTP, X11, X.500, Z39.50

947	Draft RFC               Character Set Workshop Report    November 1996

949	Appendix B: Acronyms

951	ASCII       American National Standard Code for Information Character
952	              Sets
953	CCS         Coded Character Sets
954	CEN ENV     European Committee for Standardisation (CEN) European
955	              pre-standard (ENV)
956	CES         Character Encoding Scheme
957	CJK         Chinese Japanese Korean
958	CORBA       Common Object Request Broker Architecture
959	CTE         Content Transfer Encoding
960	DNS         Domain Name Service
961	ESMTP       Extended SMTP
962	FTP         File Transfer Protocol
963	HTML        Hypertext Transfer Protocol
964	I18N        Internationalization (or 18 characters between the first
965	              (I) and last (n)character)
966	IAB         Internet Activities Board
967	IANA        Internet Assigned Numbers Authority
968	IESG        Internet Engineering Steering Group
969	IETF        Internet Engineering Task Force
970	IMAP        Internet Message Access Protocol
971	IRC         Internet Relay Chat
972	IRTF        Internet Research Task Force
973	ISI         Information Sciences Institute
974	ISO         International Standards Organization
975	MIME        Multipurpose Internet Mail Extensions
976	NFS         Networked File Server
977	NNTP        Net News Transfer Protocol
978	POSIX       Portable Operating System Interface
979	RFC         Request for Comments (Internet standards documents)
980	RPC         Remote Procedure Call
981	RSTAT       Remote Statistics
982	RTCP        Real-Time Transport Control Protocol
983	Rwhois      Referral Whois
984	SGML        Standard Generalized Mark-up Language
985	SMTP        Simple Mail Transfer Protocol
986	TES         Transfer Encoding Syntax
987	TFTP        Trivial File Transfer Protocol
988	URL         Uniform Resource Locator
989	UTF         Universal Text/Translation Format

991	Draft RFC          Character Set Workshop Report          November 1996

993	Appendix C:  Glossary

995	Bi-directionality -  A property of some languages in which written
996	     text alternates direction from line to line (e.g. right-to-left
997	     one line, left-to-right the next)

999	Character - A single graphic symbol represented by sequence of one or
1000	     more bytes.

1002	Character Encoding Scheme - The mapping from a coded character set to
1003	     an encoding which may be more suitable for specific purpose. For
1004	     example, UTF-8 is a character encoding scheme for ISO 10646.

1006	Character Set - An enumerated group of symbols (e.g., letters, numbers
1007	     or glyphs)

1009	Coded Character Set - The mapping from a set of integers to a
1010	     character from a character set.

1012	Culture - Preferences in the display of text based on cultural norms,
1013	     such as spelling and word choice.

1015	Language - The words and combinations of words the constitute a system
1016	     of expression and communication among people with a shared
1017	     history or set of traditions.

1019	Layout - Information needed to display text to the user, similar to
1020	     the presentation layer in the ISO telecommunications model.

1022	Locale - The attributes of communication, such as language, character
1023	     set and cultural conventions.

1025	On-the-wire -  The data that actually gets put into packets for
1026	     transmission to other computers.

1028	Transfer Encoding Syntax -  The mapping from a coded character set
1029	     which has been encoded in a Character Encoding Scheme to an
1030	     encoding which may be more suitable for transmission using
1031	     specific protocols. For example, Base64 is a transfer encoding
1032	     syntax.

1034	Draft RFC          Character Set Workshop Report          November 1996

1036	Appendix D:  References

1038	[*]  Non-ASCII character

1040	[ASCII]  ANSI X3.4:1986  "Coded  Character Sets - 7 Bit American
1041	     National Standard Code for Information Interchange (7-bit ASCII)"

1043	[Base64] N. Borenstein, N. Freed, "MIME (Multipurpose Internet Mail
1044	     Extensions) Part One: Mechanisms for Specifying and Describing
1045	     the Format of Internet Message Bodies", RFC 1521, September 1993.

1047	[CEN]  see http://tobbi.iti.is/TC304/welcome.html for current status.

1049	[HTML] T. Berners-Lee, D. Connolly, "Hypertext Markup Language - 2.0",
1050	     RFC 1866, November 1995.

1052	[HTTP] T. Berners-Lee, R. Fielding, H. Nielsen, "Hypertext Transfer
1053	     Protocol -- HTTP/1.0", RFC 1945, May 1996

1055	[I18N]  Yergeau, F., et.al.,  "Internationalization of the Hypertext
1056	     Markup Language"  Internet draft August 1996.

1058	[IANA]  Reynolds, J., and J. Postel,  "Assigned Numbers", RFC 1700,
1059	     ISI, October 1994.

1061	[ISO-2022]  ISO/IEC 2022:1994,  "Information technology -- Character
1062	     Code Structure and Extension Techniques",  JTC1/SC2.

1064	[ISO-7498]  ISO/IEC 7498-1:1994,  "Information technology - Open Systems
1065	     Interconnection - Basic Reference Model:  The Basic Model".

1067	[ISO-8859]  Information Processing -- 8-bit Single-Byte Coded Graphic
1068	     Character Sets -- Part 1: Latin Alphabet no. 1,
1069	     ISO 8859-1:1987(E). Part 2: Latin Alphabet no. 2, ISO 8859-2
1070	     1987(E). Part 3: Latin Alphabet no. 3, ISO 8859-3:1988(E).
1071	     Part 4: Latin Alphabet no. 4, ISO 8859-4, 1988(E). Part 5:
1072	     Latin/Cyrillic Alphabet ISO 8859-5, 1988(E). Part 6:
1073	     Latin/Arabic Alphabet, ISO 8859-6, 1987(E). Part 7: Latin/Greek
1074	     Alphabet, ISO 8859-7, 1987(E). Part 8: Latin/Hebrew Alphabet, ISO
1075	     8859-8-1988(E).Part 9: Latin Alphabet no. 5, ISO 8859-9, 1990(E).
1076	     Part 10: Latin Alphabet no. 6, ISO 8859-10:1992(E).

1078	[ISO-10646]  ISO/IEC 10646-1:1993(E ),  "Information technology --
1079	     Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
1080	     Architecture and Basic Multilingual Plane".  JTC1/SC2, 1993

1082	[MIME]  Borenstein, N., and N. Freed, "MIME (Multipurpose Internet
1083	     Mail Extensions) Part One: Mechanisms for Specifying and
1084	     Describing the Format of Internet Message Bodies", RFC 1521,
1085	     Bellcore, Innosoft, September 1993.

1087	[POSIX]  Institute of Electrical and Electronics Engineers.  "IEEE
1088	     standard interpretations for IEEE standard portable operating

1090	Draft RFC          Character Set Workshop Report          November 1996

1092	     systems interface for computer environments". IEEE Std 1003.1
1093	     -1988/Int, 1992 edition.  Sponsor, Technical Committee on Operating
1094	     Systems of the IEEE Computer Society.  New York, NY: Institute of
1095	     Electrical and Electronic Engineers, 1992.

1097	RFC 1340  See [IANA]

1099	[RFC 1345]  Simonsen, K., "Character Mnemonics & Character Sets".
1100	     Rationel Alim Planlaegning, June 1992.

1102	[RFC-1554]  Ohta, M., and K. Handa,  "ISO-2022-JP-2: Multilingual
1103	     Extension of ISO-2022-JP",  Tokyo Institute of Technology, ETL,
1104	     December 1993.

1106	RFC 1642  See [UTF-7]

1108	[RFC 1766]  Alvestrad, H., "Tags for the Identification of Languages",
1109	     UNNETT, March 1995.

1111	[RFC 1958]  Carpenter, B. (ed.) "Architectural Principles of the
1112	     Internet", IAB, June 1996.

1114	[SGML] ISO 8879:1986 "Information Processing - Text and Office Systems
1115	     - Standard Generalized Markup Language (SGML)"

1117	[SMTP] J. Postel, "Simple Mail Transfer Protocol", RFC 821, STD 10,
1118	     August, 1982

1120	[Unicode]  "The Unicode standard, version 2.0.  Unicode Consortium.
1121	     Reading, Mass.: Addison-Wesley Developers Press, 1996

1123	[UTF-7]  Goldsmith, D., and M. Davis, "UTF-7: A Mail Safe
1124	     Transformation Format of Unicode", RFC 1642, Taligent, Inc., July
1125	     1994.

1127	[UTF-8]  International Standards Organization, Joint Technical
1128	     Committee 1 (ISO/JTC1), "Amendment 2:1993, UCS Transformation
1129	     Format 8 (UTF-8)", in ISO/IEC 10646-1:1993 Information technology
1130	     - Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
1131	     Architecture and Basic Multilingual Plane.  JTC1/SC2, 1993.

1133	Draft RFC          Character Set Workshop Report          November 1996

1135	Appendix E:  Recommended reading

1137	Alvestrand, H.  "Tags for the Identification of Languages."  RFC 1766.
1138	     UNINETT, March 1995.

1140	Alvestrand, H.  "X.400 Use of Extended Character Sets" RFC 1502. SINTEF
1141	     DELAB, August 1993.

1143	Borenstein, N.  "Implications of MIME for Internet Mail Gateways."
1144	     RFC 1344.  Bellcore, June 1992.

1146	Borenstein, N. and N. Freed.  "MIME (Multipurpose Internet Mail
1147	     Extensions) Part One: Mechanisms for Specifying and Describing the
1148	     Format of Internet Message Bodies."  RFC 1521.  Bellcore and
1149	     Innasoft, September 1993.

1151	Chernov, A. "Registration of a Cyrillic Character Set." RFC 1489. RELCOM
1152	     Development Team, July 1993.

1154	Choi, U. and K. Chan.  "Korean Character Encoding for Internet
1155	     Messages."  RFC 1557.  KAIST, December 1993.

1157	Freed, N. and N. Borenstein.  "Multipurpose Internet Mail Extensions
1158	     (MIME) Part Two:  Media Types."  draft-ietf-822ext-mime-reg-02.txt.
1159	     July 1993.

1161	Goldsmith, D., and M. Davis.  "Transformation Format for Unicode."
1162	     RFC 1642.  Taligent, Inc., July 1994.

1164	Goldsmith, D., and M. Davis.  "Using Unicode with MIME."  RFC 1641.
1165	     Taligent, Inc., July 1994.

1167	Jerman-Blazic, B. "Character handling in computer communication" in
1168	     "user needs in information technology standards", Computer Weekly
1169	     Professional service, eds. C.D. Evans, B.L. Meed & R.S. Walker,
1170	     P.C. Butterworth Heineman, 1993, Oxford, Boston, p. 102-129.

1172	Jerman-Blazic, B. "Tool supporting the internationalization of the
1173	     generic network services", Computer Networks and ISDN Systems,
1174	     No. 27 (1994), p. 429-435.

1176	Jerman-Blazic, B., A. Gogala and D. Gabrijelcic, "Transparent language
1177	     processing: A solution for internationalization of Internet
1178	     services", The LISA Forum Newsletter, 5 (1996) p. 12-21

1180	Lee, F., "HZ - A Data Format for Exchanging Files of Arbitrarily Mixed
1181	     Chinese and ASCII Characters."  RFC 1843.  Stanford University,
1182	     August 1995.

1184	McCarthy, J.  "Arbitrary Character Sets."  RFC 373.  Stanford
1185	     University, July 1972.

1187	Draft RFC          Character Set Workshop Report          November 1996

1189	Moore, K. "MIME (Multipurpose Internet Mail Extensions) Part Two:
1190	     Message Header Extensions for Non-ASCII Text."  RFC 1522.
1191	     University of Tennessee, September 1993.

1193	Murai, J., M. Crispin and E. von der Poel. "Japanese Character Encoding
1194	     for Internet Messages." RFC 1468. Keio University & Panda
1195	     Programming, June 1993.

1197	Nussbacher, H.  "Handling of Bi-directional Texts in MIME."  Israeli
1198	     Inter-University, December 1993.

1200	Nussbacher, H. and Y. Bourvine.  "Hebrew Character Encoding for Internet
1201	     Messages."  RFC 1555.  Israeli Inter-University and Hebrew
1202	     University, December 1993.

1204	Ohta, M.  "Character Sets ISO-10646 and ISO-10646-J-1."  RFC 1815.
1205	     Tokyo Institute of Technology, July 1995.

1207	Postel, J. and J. Reynolds.  "File Transfer Protocol (FTP)."  RFC 959.
1208	     ISI, October 1985.

1210	Postel, J. and J. Reynolds.  "Telnet Protocol Specification."  RFC 854.
1211	     ISI, May 1983.

1213	Reynolds, J., and J. Postel,  "Assigned Numbers", RFC 1700,
1214	     ISI, October 1994. p.100-117.

1216	Rose, M. "The Internet Message", Prentice Hall, 1992

1218	Simonsen, K. "Character Mnemonics & Character Sets." RFC 1345.  Rationel
1219	     Almen Planlaegning, June 1992.

1221	Unicode Consortium.  "The Unicode standard, version 2.0.  Reading,
1222	     Mass.: Addison-Wesley Developers Press, 1996

1224	Wei, U., et.al.  "ASCII Printable Characters-Based Chinese Character
1225	     Encoding for Internet Messages."  RFC 1842.  AsiInfo Services,
1226	     Inc., et.al.  August 1995.

1228	Zhu, H., et.al.  "Chinese Character Encoding for Internet Messages."
1229	     RFC 1922.  Tsinghua University, et.al., March 1996

1231	Draft RFC          Character Set Workshop Report          November 1996

1233	Appendix F: Workshop attendee list

1235	These people were participants on the workshop mailing list.
1236	An * indicates that the person attended the workshop in person.

1238	  Glenn Adams <glenn@spyglass.com>
1239	* Joan Aliprand <joan@unicode.org>
1240	* Harald Alvestrand <Harald.T.Alvestrand@uninett.no>
1241	* Ran Atkinson <ran@cisco.com>
1242	* Bert Bos <bert@w3.org>
1243	* Brian Carpenter <brian@dxcoms.cern.ch>
1244	* Mark Crispin <mrc@panda.com>
1245	  Makx Dekkers <dekkers@pica.nl>
1246	  Robert Elz <kre@munnari.oz.au>
1247	  Patrik Faltstrom <paf@paf.se>
1248	* Zhu Haifeng <zhf@net.tsinghua.edu.cn>
1249	  Keniichi Handa<handa@etl.go.jp>
1250	  Olle Jarnefors <ojarnef@admin.kth.se>
1251	  Borka Jerman-Blazic <borka@e5.ijs.si>
1252	  John Klensin <klensin@mail1.reston.mci.net>
1253	* Larry Masinter <masinter@parc.xerox.com>
1254	* Rick McGowan <Rick_McGowan@next.com>
1255	* Keith Moore <moore+charsets@cs.utk.edu>
1256	* Lisa Moore <lisam@vnet.ibm.com>
1257	  Ruth Moulton <ruth@muswell.demon.co.uk>
1258	* Cecilia Preston <cecilia@well.com>
1259	* Joyce Reynolds <jkrey@isi.edu>
1260	* Keld Simonsen <keld@dkuug.dk>
1261	* Gary Smith <Gary_Smith@oclc.org>
1262	* Peter Svanberg <psv@nada.kth.se>
1263	* Chris Weider <cweider@microsoft.com >

1265	Draft RFC          Character Set Workshop Report          November 1996

1267	Appendix G: Author's addresses

1269	Chris Weider
1270	cweider@microsoft.com
1271	Microsoft Corp.
1272	1 Microsoft Way
1273	Redmond, WA 98052
1274	USA

1276	Cecilia Preston
1277	cecilia@well.com
1278	Preston & Lynch
1279	PO Box 8310
1280	Emeryville, CA 94662
1281	USA

1283	Keld Simonsen
1284	Keld@dkuug.dk
1285	DKUUG
1286	Freubjergvey 3
1287	DK-2100 Kxbenhavn X
1288	Danmark

1290	Harald T. Alvestrand
1291	Harald.T.Alvestrand@uninett.no
1292	UNINETT
1293	P.O.Box 6883 Elgeseter
1294	N-7002 TRONDHEIM
1295	NORWAY

1297	Randall Atkinson
1298	rja@cisco.com
1299	cisco Systems
1300	170 West Tasman Drive
1301	San Jose, CA 95134-1706
1302	USA

1304	Mark Crispin
1305	mrc@cac.washington.edu
1306	Networks & Distributed Computing
1307	University of Washington
1308	4545 15th Avenue NE
1309	Seattle, WA  98105-4527