idnits 2.17.1 draft-weider-iab-char-wrkshop-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-26) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 59 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 27 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack an Authors' Addresses Section. ** The abstract seems to contain references ([ISO-8859], [ISO-10646], [RFC1958], [ISO-7498], [ASCII], [RFC-1766], [SMTP], [ISO-2022], [UTF-7], [MIME], [UTF-8], [POSIX], [HTTP], [HTML], [Base64], [IANA]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 178: '...otocol machinery SHOULD NOT be changed...' RFC 2119 keyword, line 205: '...UTF-7 [UTF-7] MUST be available....' RFC 2119 keyword, line 208: '...lications; protocols SHOULD attempt to...' RFC 2119 keyword, line 220: '...vital, and MUST be supported....' RFC 2119 keyword, line 522: '...ry for decoding, stored text SHOULD be...' (8 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 81 has weird spacing: '...tion of this...' == Line 138 has weird spacing: '...certain types...' == Line 144 has weird spacing: '...llowing issue...' == Line 395 has weird spacing: '...ith the excep...' == Line 474 has weird spacing: '... in the proto...' == (12 more instances...) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (15 October 1996) is 10055 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'MIME' on line 1082 looks like a reference -- Missing reference section? 'POSIX' on line 1087 looks like a reference -- Missing reference section? 'SMTP' on line 1117 looks like a reference -- Missing reference section? 'ASCII' on line 1040 looks like a reference -- Missing reference section? 'RFC 1958' on line 1111 looks like a reference -- Missing reference section? 'UTF-8' on line 1127 looks like a reference -- Missing reference section? 'UTF-7' on line 1123 looks like a reference -- Missing reference section? 'HTML' on line 1049 looks like a reference -- Missing reference section? 'ISO-7498' on line 1064 looks like a reference -- Missing reference section? 'RFC-1766' on line 310 looks like a reference -- Missing reference section? 'ISO-10646' on line 1078 looks like a reference -- Missing reference section? 'ISO-8859' on line 1067 looks like a reference -- Missing reference section? 'ISO-2022' on line 1061 looks like a reference -- Missing reference section? 'Base64' on line 1043 looks like a reference -- Missing reference section? 'IANA' on line 1097 looks like a reference -- Missing reference section? 'HTTP' on line 1052 looks like a reference -- Missing reference section? 'SGML' on line 1114 looks like a reference -- Missing reference section? 'CEN' on line 1047 looks like a reference -- Missing reference section? 'RFC-1345' on line 609 looks like a reference -- Missing reference section? 'RFC-1554' on line 1102 looks like a reference -- Missing reference section? 'I18N' on line 1055 looks like a reference -- Missing reference section? 'RFC 1345' on line 1099 looks like a reference -- Missing reference section? 'RFC 1766' on line 1108 looks like a reference -- Missing reference section? 'Unicode' on line 1120 looks like a reference Summary: 11 errors (**), 0 flaws (~~), 9 warnings (==), 27 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 The Report of the IAB Character Set Workshop 2 held 29 February - 1 March, 1996 3 INTERNET-DRAFT - version 3.3 15 October 1996 4 5 Expire in six months 7 Chris Weider, Chair 8 Cecilia Preston, Preston & Lynch 9 Keld Simonsen, DKUUG 10 Harald Alvestrand, UNINETT 11 Ran Atkinson, Cisco Systems 12 Mark Crispin, University of Washington 13 Peter Svanberg, KTH 15 Acknowledgments 17 The authors would like to sincerely thank Information Science 18 Institute (ISI), and in particular Joyce Reynolds for graciously 19 hosting this event; Joe Kemp and Jeanine Yamazaki of ISI made sure the 20 facilities met our needs. We also wish to thank the Internet Society, 21 which underwrote travel for participants who might not otherwise have 22 been able to attend. Of course, we also wish to thank the many 23 experts who participated in the workshop and on the mailing list; a 24 complete list of these people can be found in Appendix D. Bunyip 25 Information Systems was kind enough to provide mailing list facilities 26 for this work. 28 Table of Contents 30 Abstract 31 0: Executive summary 32 1: Introduction 33 2: Character sets on the Internet -- the problem today 34 2.1: Character set handling in existing protocols 35 3: The model 36 3.1: Components of the model 37 3.2: Recommended defaults 38 3.3: Guidelines for conversions between coded character sets 39 4: Presentation issues 40 5: Open issues 41 6: Security considerations 42 7: Conclusions 43 8: Recommendations 44 8.1: To the IAB 45 8.2: For new Internet protocols 46 8.3: For registration of new character sets 48 Appendix A: List of protocols affected by character set issues 49 Appendix B: Acronyms 50 Appendix C: Glossary 51 Appendix D: References 52 Appendix E: Recommended reading 53 Appendix F: Workshop attendee list 54 Appendix G: Author's addresses 55 Abstract 57 This report details the conclusions of an IAB-sponsored invitational 58 workshop held 29 February - 1 March, 1996, to discuss the use of 59 character sets on the Internet. It motivates the need to have 60 character set handling in Internet protocols which transmit text, 61 provides a conceptual framework for specifying character sets, 62 recommends the use of MIME tagging for transmitted text, recommends a 63 default character set *without* stating that there is no need for 64 other character sets, and makes a series of recommendations to the 65 IAB, IANA, and the IESG for furthering the integration of the 66 character set framework into text transmission protocols. 68 0: Executive summary 70 The term 'Character Set' means many things to many people. Even the 71 MIME registry of character sets registers items that have great 72 differences in semantics and applicability. This workshop provides 73 guidance to the IAB and IETF about the use of character sets on the 74 Internet and provides a common framework for interoperability between 75 the many characters in use there. 77 The framework consists of four components: an architecture model, which 78 specifies components necessary for on-the-wire transmission of text; 79 recommendations for tagging transmitted (and stored) text; recommended 80 defaults for each level of the model; and a set of recommendations to 81 the IAB, IANA, and the IESG for furthering the integration of this 82 framework into text transmission protocols. 84 The architectural model specifies 7 layers, of which only three are 85 required for on-the-wire transmission. The Coded Character Set is a 86 mapping from a set of abstract characters to a set of integers. The 87 Character Encoding Scheme is a mapping from a Coded Character Set (or 88 several) to a set of octets. The Transfer Encoding Syntax is a 89 transformation applied to data which has been encoded using a 90 Character Encoding Scheme to allow it to be transmitted. These layers 91 should be specified in a transmitted text stream by using the MIME 92 encoding mechanisms. 94 This report recommends the use of ISO 10646 as the default Coded 95 Character Set, and UTF-8 as the default Character Encoding Scheme in 96 the creation of new protocols or new version of old protocols which 97 transmit text. These defaults do not deprecate the use of other 98 character sets when and where they are needed; they are simply 99 intended to provide guidance and a specification for interoperability. 101 1: Introduction 103 This is the report of an IAB-sponsored invitational workshop on the 104 use of Character Sets on the Internet, held 29 February - 1 March 1996 105 at Information Science Institute (ISI) in Marina del Rey, California. 106 In addition, this report covers the discussion on the mailing list up 107 to and slightly beyond the workshop itself. The goals of this 108 workshop were to provide guidance to the IAB and the IETF about the 109 use of character sets on the Internet, and if possible a common 110 framework for interoperability between the many character sets in use 111 there. Both goals were achieved. 113 2: Character sets on the Internet - the problem 115 The term 'character set' is typically applied to the contents of a 116 wide variety of text transmission and display protocols used on the 117 Internet. Because the term is used to mean different things, 118 confusion has arisen. For example, the MIME registry of character 119 sets [MIME] contains items that may differ greatly in their 120 applicability and semantics in various Internet protocols. 122 In addition, there is a vast profusion of different text encoding 123 schemes in use on the Internet. This per se is not a problem; each 124 scheme has evolved to meet real needs. However, information 125 applications such as mail, directories, and the World Wide Web have 126 each developed different techniques for dealing with the growing number 127 of schemes. A robust information architecture for the Internet 128 requires as much interoperability between these techniques as possible. 130 2.1: Related topics deemed out of scope for this workshop 132 Successful display of plain text transmitted over the Internet requires 133 a lot of information about the text itself, such as the underlying 134 character set, language, and so forth. An additional set of formatting 135 information is needed if the receiving application wishes to use local 136 (cultural) conventions when it presents the data to the user. This 137 formatting includes information, that provides the data necessary to 138 format certain types of textual data (dates, times, numbers and 139 monetary notation) into a form which is familiar to the user. The POSIX 140 [POSIX] notation of locale encompasses language, coded character set and 141 cultural conventions. 143 To avoid unfruitful discussion, and to make the best use of the time 144 available for the workshop, we declared the following issues out of 145 scope for the purposes of this workshop: 147 - glyphs 148 - sorting 149 - culture (e.g. do we present the American or British spelling?) 150 - user interface issues 151 - internal representation of textual data 152 - included characters (why aren't certain characters available in 153 any character set?) 154 - locale (in the POSIX sense) 155 - font registration 156 - semantics 157 - user input/output issues 158 - Han unification issues 159 There are some related issues which were included for discussion, most 160 importantly the 'locale' components necessary for transport and 161 identification of multilingual texts. 163 2.2: Character Set handling in existing protocols 165 One of the group's overriding concerns was that the framework 166 developed for character set handling not break existing protocols. 167 With that in mind, the way character sets are being used in existing 168 protocols was examined. See Appendix A for a list of those protocols 169 and some recommendations for change. 171 2.2.1: General comments 173 The problem areas here fall into three main categories: protocols, 174 identifiers, and data. 176 2.2.1.1: Protocols 178 The protocol machinery SHOULD NOT be changed; allowing, for instance, 179 SMTP [SMTP] to use both MAIL FROM and POST FRA is dangerous to the 180 protocols' stability. However, many protocols carry error messages 181 and other information that is intended for human consumption; it MIGHT 182 be an advantage to allow these to be localized into a specific 183 language and character set, rather than staying in English and 184 US-ASCII [ASCII]. If this is done, new extensions should follow the 185 framework outlined below. 187 2.2.1.2: Identifiers. 189 There is a strong statement of direction from the IAB, RFC 1958 190 [RFC 1958], which states: 192 4.3 Public (i.e. widely visible) names should be in case 193 independent ASCII. Specifically, this refers to DNS names, 194 and to protocol elements that are transmitted in text format. 195 ... 196 5.4 Designs should be fully international, with support for 197 localization (adaptation to local character sets). In 198 particular, there should be a uniform approach to character 199 set tagging for information content. 201 In protocols that up to now have used US-ASCII only, UTF-8 [UTF-8] 202 forms a simple upgrade path; however, its use should be negotiated 203 either by negotiating a protocol version or by negotiating charset 204 usage, and a fallback to a US-ASCII compatible representation such as 205 UTF-7 [UTF-7] MUST be available. 207 The need for passing application data such as language on individual 208 identifiers varies between applications; protocols SHOULD attempt to 209 evaluate this need when designing mechanisms. Applying the ASCII 210 requirement for identifiers that are only used in a local context 211 (such as private mailbox folder names) is both unrealistic and 212 unreasonable; in such cases, methods for consistency in the handling 213 of character set should be considered. 215 2.2.1.3: Data 217 Data that require character set handling includes text, databases, 218 and HTML [HTML] pages, for example. In these the support for multiple 219 character sets and proper application information is absolutely 220 vital, and MUST be supported. 222 2.3: Architectural requirements 224 To address the issues enumerated for this work, first an architectural 225 model was created which establishes the components that are required 226 to fully specify the transmission of textual data. Many of these 227 components are already familiar to the users of encoding protocols 228 such as MIME. Not all of these are discussed in detail in this 229 report; we restrict ourselves primarily to those components which are 230 required to specify the 'on-the-wire' phase of text transmission. 232 Mandating a single, all-encompassing character set would not fit well 233 with the IETF philosophy of planning for architectural diversity. So, 234 the best that can be done is to provide a common *framework* for 235 identifyin and using the multitude of character sets available on the 236 Internet. It would be an advantage if the total number of Coded 237 Character Sets could be kept to a minimum. This framework should meet 238 the following requirements: 240 - it should not break existing protocols (because then the likelihood 241 of deployment is very small), 242 - it should allow the use of character sets currently used on the 243 Internet, and 244 - it should be relatively easy to build into new protocols. 246 3: Architectural model 248 The basic architectural model which guided our discussions is shown in 249 below. A distinction was made between those segments which were 250 necessary to successfully transmit character set data on-the-wire and 251 those needed to present that data to a user in a comprehensible manner. 252 The discussions were primarily restricted to those segments of the model 253 which specify the 'on-the-wire' transmission of textual data. 255 User interface issues: these are briefly discussed in Section 3.1.1. 256 Layout 257 Culture 258 Locale 259 Language 260 On-the-wire: see section 3.2 for detailed discussion. 261 Transfer Syntax 262 Character Encoding Scheme 263 Coded Character Set 265 3.1: Segments defined 267 3.1:1: User interface 269 3.1.1.1: Layout 271 Layout includes the elements needed for displaying text to the user, 272 such as font selection, word-wrapping, etc. It is similar to the 273 'presentation' layer in the 7-layer ISO telecommunications model 274 [ISO-7498]. 276 3.1.1.2: Culture 278 Culture includes information about cultural preferences, which affect 279 spelling, word choice, and so forth. 281 3.1.1.3: Locale 283 The locale component includes the information necessary to make choices 284 about text manipulation which will present the text to the user in an 285 expected format. This information may include the display of date, time 286 and monetary symbol preferences. Notice that locale modifications are 287 typically applied to a text stream before it is presented to the user, 288 although they also are used to specify input formats. 290 3.1.1.4: Language 292 This component specifies the language of the transmitted text. At 293 times and in specific cases, language information may be required to 294 achieve a particular level of quality for the purpose of displaying a 295 text stream. For example, UTF-8 encoded Han may require transmission 296 of a language tag to select the specific glyphs to be displayed at a 297 particular level of quality. 299 Note that information other than language may be used to achieve the 300 required level of quality in a display process. In particular, a font 301 tag is sufficient to produce identical results. However, the 302 association of a language with a specific block of text has usefulness 303 far beyond its use in display. In particular, as the amount of 304 information available in multiple languages on the World Wide Web 305 grows, it becomes critical to specify which language is in use in 306 particular documents, to assist automatic indexing and retrieval of 307 relevant documents. 309 The term 'language tag' should be reserved for the short identifier of 310 RFC 1766 [RFC-1766] that only serves to identify the language. While 311 there may be other text attributes intimately associated with the 312 language of the document, such as desired font or text direction, 313 these should be specified with other identifiers rather than 314 overloading the language tag. 316 3.2: On the wire 318 There are three segments of the model which are required for 319 completely specifying the content of a transmitted text stream (with 320 the occasional exception of the Language component, mentioned above). 321 These components are: 323 1) Coded Character Set, 324 2) Character Encoding Scheme, and 325 3) Transfer Encoding Syntax. 327 Each of these abstract components must be explicitly specified by the 328 transmitter when the data is sent. There may be instances of an 329 implicit specification due to the protocol/standard being used (i.e. 330 ANSI/NISO Z39.50). Also, in MIME, the Coded Character Set and Character 331 Encoding Scheme are specified by the Charset parameter to the 332 Content-Type header field, and Transfer Encoding Syntax is specified by 333 the Content-Transfer-Encoding header field. 335 3.2.1: Coded Character Set 337 A Coded Character Set (CCS) is a mapping from a set of abstract 338 characters to a set of integers. Examples of coded character sets are 339 ISO 10646 [ISO-10646], US-ASCII [ASCII], and ISO-8859 series 340 [ISO-8859]. 342 3.2.2: Character Encoding Scheme 344 A Character Encoding Scheme (CES) is a mapping from a Coded Character 345 Set or several coded character sets to a set of octets. Examples of 346 Character Encoding Schemes are ISO-2022 [ISO-2022] and UTF-8 [UTF-8]. 348 3.2.3: Transfer Encoding Syntax 350 It is frequently necessary to transform encoded text into a format 351 which is transmissible by specific protocols. The Transfer Encoding 352 Syntax (TES) is a transformation applied to character data 353 encoded using a CCS and possibly a CES to allow it to be transmitted. 354 Examples of Transfer Encoding Syntaxes are Base64 Encoding [Base64], 355 gzip encoding, and so forth. 357 3.3: Determining which values of CCS, CES, and TES are used 359 To completely specify which CCS, CES, and TES are used in a specific 360 text transmission, there needs to be a consistent set of labels for 361 specifying which CCS, CES, and TES are used. Once the appropriate 362 mechanisms have been selected, there are six techniques for attaching 363 these labels to the data. 365 The labels themselves are named and registered, either with IANA 366 [IANA] or with some other registry. Ideally, their definitions are 367 retrievable from some registration authority. 369 Labels may be determined in one of the following ways: 371 - Determined by guessing, where the receiver of the text has to 372 guess the values of the CCS, CES, and TES. For example: "I got 373 this from Sweden so it's probably ISO-8859-1." This is 374 obviously not a very foolproof way to decode text. 375 - Determined by the standard, where the protocol used to transmit 376 the data has made documented choices of CCS, CES, and TES in the 377 standard. Thus, the encodings used are known through the 378 access protocol, for example HTTP [HTTP] uses (but is not 379 limited to) ISO-8859-1, SMTP uses US-ASCII. 380 - Attached to the transfer envelope, where the descriptive labels are 381 attached to the wrapper placed around the text for transport. 382 MIME headers are a good example of this technique. 383 - Included in the data stream, where the data stream itself has 384 been encoded in such a way as to signal the character set used. 385 For example, ISO-2022 encodes the data with escape sequences to 386 provide information on the character subset currently being used. 387 - Agreed by prior bilateral agreement, where some out-of-band 388 negotiation has allowed the text transmitter and receiver to 389 determine the CCS, CES, and TES for the transmitted text. 390 - Agreed to by negotiation during some phase, typically initialization 391 of the protocol. 393 3.3.1: Recommendations for value specification mechanisms 395 While each of these techniques (with the exception of guessing) is 396 useful in particular situations, interoperability requires a more 397 consistent set of techniques. Thus, we recommend that MIME registered 398 values be used for all tagging of character sets and languages UNLESS 399 there is an existing mechanism for determining the required 400 information using one of the other techniques (except guessing). This 401 recommendation will require a fair bit of work on the part of protocol 402 designers, implementors, the IETF, the IESG, and the IAB. 404 However, it is important to point out that the MIME concept of 405 'charset' in some cases cuts across several layers of components in 406 our model. While this can be accepted in existing registrations, we 407 also recommend that the MIME registration procedure for character sets 408 be modified to show how a proposed character set deals with the CCS 409 and the CES. 411 There are a number of other recommendations, but these will be covered 412 in the next sections. 414 3.4: Recommended Defaults 416 For a number of reasons, one cannot define a mandatory set of defaults 417 for all Internet protocols. There is a mass of current practice, 418 future protocols are likely to have different purposes, which may 419 determine their handling of text, and protocols may need specific 420 variation support. For example, in mail, text is a predominant data 421 type and coded character sets then become a major issue for the 422 protocol. Also, since e-mail is ubiquitous and users expect to be 423 able to send it to everyone, the mail protocols need to be quite adept 424 at handling different character set encodings. On the other hand, if 425 strings are seldom used in a given protocol, there is no need to weigh 426 the protocol down with a sophisticated apparatus for handling multiple 427 character sets, assuming that the predicated character set can handle 428 all the protocol's needs. This observation also applies to the 429 specification techniques for character set parameters. If only one 430 character set encoding is needed, it can be made explicit in the 431 protocol specification. Protocols with a greater need for character 432 set support will need a more elaborate specification technique. 434 3.4.1: Clarity of specification 436 We recommend that each protocol clearly specify what it is using for 437 each of the layers of the transmission model. Users (or clients) 438 should never have to guess what the parameter is for a given layer. 440 3.4.2: Default Coded Character Set: 442 The default Coded Character Set is the repertoire of ISO-10646. 444 3.4.3: Default Character Encoding Scheme 446 For text-oriented protocols, new protocols should use UTF-8, and 447 protocols that have a backwards compatibility requirement should use 448 the default of the existing protocol, e.g. US-ASCII for mail, and 449 ISO-8859-1 for HTTP. The recommended specification scheme is the MIME 450 "charset" specification, using the IANA "charset" specifications. The 451 MIME specifications will need to be clarified to meet this model in 452 the future. 454 For other protocols, the default should be UTF-8 as this initially 455 allows US-ASCII to be entered as-is, and enables the full repertoire 456 of ISO 10646. 458 Some protocols, such as those descended from SGML [SGML], have other 459 natural notations for characters outside their "natural" repertoire; 460 for instance, HTML [HTML] allows the use of &#nnnn to refer to any ISO 461 10646 character. Note that this, like all other encodings that depend 462 on "escape characters", redefines at least one character from the base 463 character set for use as an indicator of "foreign" characters. Use of 464 this approach must be weighed very carefully. 466 3.4.4: Default Transport Encoding Scheme 468 There is no recommended default for this level. For plain text 469 oriented protocols, the bytestream transport format should be 8-bit 470 clean, possibly with normalization of end-of-line indicators. Some 471 special cases could be made for protocols that are not 8-bit clean, 472 such as encoding it for transport over 7-bit connections. For binary 473 the same recommendation holds as above. The specification technique 474 should either be defined in the protocol, if only one way is 475 permitted, or by use of MIME content-transfer-encoding (CTE) 476 techniques, using IANA registered values. 478 3.4.5: Default Language 480 There is no recommended default for the language level. For human 481 readable text, there should always be a way to specify the natural 482 language. The specification technique should be a MIME identifier with 483 IANA registered values for languages. If headers are used, the 484 header should be 'Content-Language' 486 3.4.6: Default Locale 488 The default should be the POSIX locale. The specification technique 489 should use the Cultural register of CEN ENV 12005 [CEN] for the values. 490 If headers are used, the header should be 'Content-Locale'. 492 3.4.7: Default Culture 494 There is no recommended default for the Culture level. The 495 specification technique should be a MIME or MIME-like identifier 496 (e.g. Content-Culture) and should use the Cultural register of CEN ENV 497 12005 for its values. 499 3.4.8: Default Presentation 501 There is no recommended default for the Presentation level. The 502 specification technique should be a MIME or MIME-like identifier (e.g. 503 Content-Layout) and use the glyph register of ISO 10036 and other 504 registers for its values. 506 3.4.9: Multiplexing 508 In some cases, text transmission may require the use of a number of 509 different values for a given parameter; for example, English 510 annotation of Japanese text might well require shifting the 511 Content-Language parameter. The way to switch the value of parameters 512 within a single body of text depends on the application. For 513 instance, the HTML I18N [I18N]work defines a construct 514 for the purpose of switching between different languages. When only one 515 value is needed, this value should be as general as possible, and 516 specified in the protocol standard with reference to the IANA or other 517 registry value. All levels should be specified explicitly. 519 3.4.10: Storage 521 Because stored text may very well be stored without any of the 522 additional information necessary for decoding, stored text SHOULD be 523 tagged in a MIME compliant fashion. This alleviates the problem of 524 being unable to interpret text which has been stored for a long time, 525 or text whose provenance is not available. 527 3.5: Guidelines for conversions between coded character sets 529 This section covers various algorithms to convert a source text S, 530 encoded in the coded character set CCS(S), to a target text T, encoded 531 in the coded character set CCS(T). 533 Rep(X) is the character repertoire of coded character set X, i.e. the 534 set of characters which can be represented with X. 536 3.5.1: Exact conversion 538 When Rep(CCS(S)) and Rep(CCS(T)) are equal or Rep(CCS(S)) is a subset 539 of Rep(CCS(T)), exact conversion is possible; i.e. T is equal to S. 540 The octets just need to be remapped. The algorithm for performing 541 this remapping is simple, if the IANA-registered definition tables for 542 CCS(S) and CCS(T) are available. 544 3.5.2: Approximate conversion 546 In all other cases, any conversion creates a text T which differs from 547 S. There are different principles for how this inevitable difference 548 should be handled. A choice between them should be made, depending on 549 the purpose and requirements of the conversion. Where possible, the 550 client application should be given mechanisms to determine what has 551 been done to the text. 553 3.5.2.1: Length-modifying conversion for human display 555 When the length of the target text T is allowed to differ from the 556 length of the source text S, one should use a conversion method in 557 which each source character is converted to one or several target 558 character(s), using a best resemblance criteria in the choice of that 559 target character(s). 561 Examples: 562 LATIN CAPITAL LETTER [*] -> AE 563 COPYRIGHT SIGN [*] -> (c) 565 3.5.2.2: Length-preserving conversion for human display 567 Where the text T must be presented and the length of T cannot differ 568 from the length of S, one should use a conversion method where each 569 source character is converted to one target character, using some kind 570 of best resemblance criteria in the choice of target character. 572 Examples: 573 LATIN CAPITAL LETTER [*] -> A 574 COPYRIGHT SIGN [*] -> C 576 3.5.2.3: Conversion without data loss 578 Where the conversion of the text S into T must be completely 579 reversible, apply a Character Encoding Syntax or other reversible 580 transformation method. This case is most frequently met in data 581 storage requirements. 583 Examples: 584 LATIN CAPITAL LETTER [*] -> &AE 585 COPYRIGHT SIGN [*] -> &(C 587 An alternate method, which can be used if the size of Rep(CCS(T)) >= 588 Rep(CCS(S)), then for each character in Rep(CCS(S)) which is not 589 present in Rep(CCS(T)), define a mapping into a character in 590 Rep(CCS(T)) which is not present in Rep(CCS(S)). 592 Examples: 593 LATIN CAPITAL LETTER [*] -> CYRILLIC CAPITAL LETTER [*] 594 COPYRIGHT SIGN [*] -> PARTIAL DIFFERENTIAL SIGN [*] 596 Note that conversion without data loss requires redefining some member 597 of T to indicate "the introduction of character data outside T". This 598 effectively adds another level of CES on top of CES(T). 600 4: Presentation issues 602 There are a number of considerations to make in selecting the base 603 character set. One such consideration is the protocol's convenience 604 to users with limited equipment (for example only ISO 8859-1 or a 605 keyboard without the ability to enter all the characters in ISO 606 10646). Alternative representation should be considered for these 607 users, both for input and output. Possible options for the 608 representation of characters that can not be displayed include 609 transliteration (a la CEN/TC304 or ISO TC46/SC2 ), RFC 1345 [RFC-1345] 610 representative icons, or the WG2 short name (u+xxxx). 612 5: Open issues 614 In addition to the issues declared out of scope and enumerated in 615 section 2.1, the following issues are still open and will need to be 616 addressed in other forums. These issues: language tags, public 617 identifiers such as URL names, and bi-directionality are briefly 618 discussed below as they repeatedly encroached the discussion. 620 5.1 Language tags 622 Although the workshop decided not to explicitly address the so-called 623 "CJK issue", a few members felt it was necessary to have some 624 mechanism to address the problem of correct Han character display in 625 the ISO-10646 issue, and that saying that it was a "font issue" would 626 not suffice. 628 The "CJK issue" refers to the extended discussion about "Han 629 unification", the use of a single ISO-10646 codepoint to represent 630 multiple national variants of a Chinese (Han) character. ISO-10646 631 can map uniquely to any single CJK national character set, but in the 632 absence of additional information an application can not display an 633 ISO-10646 text using the proper national variants for that text. 635 It was agreed that language tags would be sufficient to disambiguate 636 unified characters. There was not, in our opinion, a significant 637 technical difference between the use of different coded character sets 638 with overlapping codepoints, and a single coded character set with 639 language tags. Either way, the application has sufficient information 640 to display the text properly. 642 It was observed that in contemporary usage of MIME charsets, the 643 language is implied as well as the coded character set and the 644 character encoding syntax. We agreed that this is excessive 645 overloading of MIME charsets. 647 To specify the language used in a particular block of text, we 648 recommend that the MIME tag "Content-Language" be used. There are a 649 number of questions about this approach that need to be worked out, 650 however: 652 - Is Content-Language: actually suitable? 653 - Is there an overload between this function and the other 654 intended functions of Content-Language: as described in RFC 655 1766? 656 - What, precisely, does "Content-Language: zh-tw, ja, ko, zh-cn" 657 mean in this context? We believe it means that, in drawing a 658 Han character, the Taiwanese variant (presumably traditional 659 Han) is preferred, followed by the Japanese, Korean, and 660 mainland Chinese (presumably simplified Han) variants. It does 661 *NOT* mean "mixed text containing Taiwanese, Japanese, Korean, 662 and mainland Chinese text with all the national variants in 663 each of these". 665 Mixed CJK text, that simultaneously displays different variants 666 occupying the same codepoint, requires language tags embedded in the 667 data. Ohta and Handa propose in RFC 1554 [RFC-1554] a MIME charset 668 using ISO-2022 shifts between multiple coded character sets; in effect 669 this is an encoding that uses coded character sets for displaying the 670 appropriate glyphs. 672 There is some speculation that states that mixed CJK text is 673 relatively infrequent, and that therefore it is acceptable to require 674 that such text be represented using a rich text format that can 675 support language tags. In other words, that a simplifying assumption 676 can be made for TEXT/PLAIN in email using ISO-10646 that will not 677 require multiple display representations for the same codepoint. A 678 mechanism such as RFC 1554 could address this need if it was 679 important; although arguably RFC 1554 should really be identified as 680 TEXT/ISO-2022. 682 Note again that we recommend that support for language tagging SHOULD 683 be built into new protocols, as this will become a critical component 684 of the automated indexing and retrieval in information applications of 685 the future. 687 5.2: Public identifiers 689 There is a considerable demand from the user community for the ability 690 to use non-ASCII characters in URL names, IMAP mailbox names, file 691 names, and other public identifiers. This is still an open problem. 693 5.3: Bi-directionality 695 It was realized that a consistent framework for bi-directional text 696 was needed but there was no attempt to work on it in this workshop. 698 6: Security considerations 700 There are no security considerations associated with character sets. 702 7: Conclusions 704 This paper provides a conceptual framework and a set of 705 recommendations which, if adopted, should provide a solid foundation 706 for interoperability on the Internet. There are, however, a number of 707 open issues which will need to be addressed to provide ever better use 708 of text on the Internet. 710 8: Recommendations 712 8.1: To the IAB 714 There were a number of recommendations to the IAB about making the 715 standards process more aware of the need for character set 716 interoperability, and about the framework itself. 718 A: The IAB should trigger the examination of all RFCs to determine the 719 way they handle character sets, and obsolete or annotate the RFCs 720 where necessary. 722 B: The IESG should trigger the recommendation of procedures to the RFC 723 editor to encourage RFCs to specify character set handling if they 724 specify the transmission of text. 726 C: The IAB should trigger the production of a perspectives document on 727 the character set work that has gone on in the past and relate it to 728 the current framework. 730 D: Full ISO 10646 has a sufficiently broad repertoire, and scope for 731 further extension, that it is sufficient for use in Internet Protocols 732 (without excluding the use of existing alternatives). There is no 733 need for specific development of character set standards for the 734 Internet. 736 E: The IAB should encourage the IRTF to create a research group to 737 explore the open issues of character sets on the Internet. This group 738 should set its sights much higher than this workshop did. 740 F: The IANA (perhaps with the help of an IETF or IRTF group) should 741 develop procedures for the registration of new character sets for use 742 in the Internet. 744 G: Register UTF-8 as a Character Encoding Scheme for MIME. 746 H: The current use of the "x-*" format for distinguishing experimental 747 tags should be continued for private use among consenting parties. All 748 other namespaces should be allocated by IANA. 750 I: Application protocol RFCs SHOULD include a section on 751 "multilingual Considerations". 753 J: Application Protocol RFCs SHOULD indicate how to transfer 'on the 754 wire' all characters in the character sets they use. They SHOULD also 755 specify how to transfer other information that applications may need 756 to know about the data. 758 K: The IESG should trigger a set of extensions to RFC 1522 to allow 759 language tagging of the free text parts of message headers. 761 8.2: For new Internet protocols 763 New protocols do not suffer from the need to be compatible with old 764 7-bit pipes. New protocol specifications SHOULD use ISO 10646 as the 765 base charset unless there is an overriding need to use a different 766 base character set. 768 New protocols SHOULD use values from the IANA registries when 769 referring to parameter values. The way these values are carried in 770 the protocols is protocol dependent; if the protocol uses RFC-822-like 771 headers, the header names already in use SHOULD be used. 773 For protocols with only a single choice for each component, the 774 protocol should use the most general specification and should be 775 specified with reference to the registered value in the protocol 776 standard. 778 Protocols SHOULD tag text streams with the language of the text. 780 8.3: For the registration of new character sets 782 Ned Freed will be releasing a new MIME registration document in 783 conjunction with this paper. 785 8.3.1: A definition table for a coded character set 787 A definition table for a coded character set A must for each character 788 C that is in the repertoire of A give: 790 a) If C is present in ISO 10646, the code value (in hexadecimal form) 791 for that character. 793 b) If C is not present in ISO 10646, but may be constructed using ISO 794 10646 combining characters, the series of code values (in 795 hexadecimal form) used to construct that character. 797 c) if C is not present in ISO 10646, a textual description of the 798 character, and a reference to its origin. 800 8.3.2: A definition of a character encoding scheme 802 A definition of a character encoding scheme consists of: 804 - A description of an algorithm which transforms every possible 805 sequence of octets to either a sequence of pairs or to the error state "illegal octet sequence" 807 - Specifications, either by reference to CCS's registered by IANA or 808 in text, of each CCS upon which this CES is based. 810 Appendix A: 812 A-1: IETF Protocols 814 The following list describes how various existing protocols handle 815 multiple character set information. 817 Email 819 SMTP 820 See 8.2. ESMTP makes it easy to negotiate the use of alternate 821 language and encoding if it is needed. 822 Headers 823 RFC 1522 forms an adequate framework for supporting text; UTF-8 824 alone is not a possible solution, because the mail pathways are 825 assumed to be 7-bit 'forever'. However, RFC 1522 should be extended 826 to allow language tagging of the free text parts of message 827 headers. 828 Bodies 829 Selection of charset parameters for Email text bodies is 830 reasonably well covered by the charset= parameter on Text/* MIME 831 types. Language is defined by the Content-language header of RFC 832 1766. Other information will have to be added using body part 833 headers; due to the way MIME differentiates between body part 834 headers and message headers, these will all have to have names 835 starting with Content- . 837 NetNews 839 NNTP 840 See 8.2. No strong tradition for negotiation of encoding in NNTP 841 exists. 842 NetNews Messages 843 These should be able to leverage off the mechanisms defined for 844 Email. One difference is that nearly all NNTP channels are 8-bit 845 clean; some NNTP newsgroups have a tradition of using 8-bit 846 charsets in both headers and bodies. Defining character set 847 default on a per newsgroup basis might be a suitable approach. 849 RTCP 850 The identifiers carried as information about parties are already 851 defined to be in UTF-8. 853 FTP 854 Protocol 855 See 8.2. The common use of welcome banners in the login response 856 means that there might be strong reason here to allow client and 857 server to negotiate a language different from the default for 858 greetings and error messages. This should be a simple protocol 859 extension. 860 Filenames 861 Many fileservers now how have the capability of using non-ASCII 862 characters in filenames, while the "dir" and "get" commands of 864 Draft RFC Character Set Workshop Report November 1996 866 FTP 867 are defined in terms of US-ASCII only. One possible solution 868 would be to define a "UTF-8" mode for the transfer of filenames 869 and directory information; this would need to be a negotiated 870 facility, with fallback to US-ASCII if not negotiated. The 871 important point here is consistency between all implementations; 872 a single charset is better here than the ability to handle 873 multiple charsets. 875 World Wide Web 876 HTTP 877 See 8.2. The single-shot stype of HTTP makes negotiation more 878 complex than it would otherwise be. 879 HTML 880 Internationalization of HTML [I18N] seems fairly well covered in 881 the current "I18N" document. It needs review to see if it needs 882 more specific details in order to carry application information 883 apart from the language. 885 URLs 886 URLs are "input identifiers", and powerful arguments should be 887 made if they are ever to be anything but US-ASCII. 889 IMAP 890 IMAP's information objects are MIME Email objects, and therefore 891 are able to use that standard's methods. However, IMAP folder 892 names are local identifiers; there is strong reason to allow 893 non-ASCII characters in these. A UTF-8 negotiation might be the 894 most appropriate thing, however, UTF-8 is awkward to use. 895 Unfortunately, UTF-7 isn't suitable because it conflicts with 896 popular hierarchy delimiters. The most recent IMAP draft 897 specification describes a modified UTF-7 which avoids this 898 problem. 900 DNS 901 DNS names are the prime example of identifiers that need to stay 902 in US-ASCII for global interoperability. However, some DNS 903 information, in particular TXT records, may represent information 904 (such as names) that is outside the ASCII range. A single 905 solution is the best; problems resulting from UTF-8 should be 906 investigated. 908 WHOIS++ 909 WHOIS++ version 1 is defined to use ISO 8859-1. The next version 910 will use UTF-8. The currently designed changes will also allow the 911 specification of individual attributes on attribute names; these 912 will make the passing of application information about the values 913 (such as language) easier. No immediate action seems necessary. 915 WHOIS 916 This has been a stable protocol for so many years now that it 917 seems unwise to suggest that it be modified. Furthermore, 918 compatible extensions exist in RWHOIS and WHOIS++; modification 920 Draft RFC Character Set Workshop Report November 1996 922 should rather be made to these protocols than to the WHOIS 923 protocol itself. 925 Telnet 926 This is a prime example of protocol where character set support 927 is necessary and nonexistent. The current draft on character set 928 negotiation in Telnet seems adequate to the task; the question of 929 passing other application data that might be useful is still 930 open. 932 A-2: Non-IETF protocols 934 For these protocols, the IETF does not have any power to change them. 935 However, the guidelines developed by the workshop may still be useful 936 as input to the further development of the protocols. 938 Gopher: Gopher, Gopher+ 940 Prospero (Archie) 942 NFS: Filesystem 944 CORBA, Finger, GEDI, IRC, ISO 10160/1, Kerberos, LPR, RSTAT, RWhois, 945 SGML, TFTP, X11, X.500, Z39.50 947 Draft RFC Character Set Workshop Report November 1996 949 Appendix B: Acronyms 951 ASCII American National Standard Code for Information Character 952 Sets 953 CCS Coded Character Sets 954 CEN ENV European Committee for Standardisation (CEN) European 955 pre-standard (ENV) 956 CES Character Encoding Scheme 957 CJK Chinese Japanese Korean 958 CORBA Common Object Request Broker Architecture 959 CTE Content Transfer Encoding 960 DNS Domain Name Service 961 ESMTP Extended SMTP 962 FTP File Transfer Protocol 963 HTML Hypertext Transfer Protocol 964 I18N Internationalization (or 18 characters between the first 965 (I) and last (n)character) 966 IAB Internet Activities Board 967 IANA Internet Assigned Numbers Authority 968 IESG Internet Engineering Steering Group 969 IETF Internet Engineering Task Force 970 IMAP Internet Message Access Protocol 971 IRC Internet Relay Chat 972 IRTF Internet Research Task Force 973 ISI Information Sciences Institute 974 ISO International Standards Organization 975 MIME Multipurpose Internet Mail Extensions 976 NFS Networked File Server 977 NNTP Net News Transfer Protocol 978 POSIX Portable Operating System Interface 979 RFC Request for Comments (Internet standards documents) 980 RPC Remote Procedure Call 981 RSTAT Remote Statistics 982 RTCP Real-Time Transport Control Protocol 983 Rwhois Referral Whois 984 SGML Standard Generalized Mark-up Language 985 SMTP Simple Mail Transfer Protocol 986 TES Transfer Encoding Syntax 987 TFTP Trivial File Transfer Protocol 988 URL Uniform Resource Locator 989 UTF Universal Text/Translation Format 991 Draft RFC Character Set Workshop Report November 1996 993 Appendix C: Glossary 995 Bi-directionality - A property of some languages in which written 996 text alternates direction from line to line (e.g. right-to-left 997 one line, left-to-right the next) 999 Character - A single graphic symbol represented by sequence of one or 1000 more bytes. 1002 Character Encoding Scheme - The mapping from a coded character set to 1003 an encoding which may be more suitable for specific purpose. For 1004 example, UTF-8 is a character encoding scheme for ISO 10646. 1006 Character Set - An enumerated group of symbols (e.g., letters, numbers 1007 or glyphs) 1009 Coded Character Set - The mapping from a set of integers to a 1010 character from a character set. 1012 Culture - Preferences in the display of text based on cultural norms, 1013 such as spelling and word choice. 1015 Language - The words and combinations of words the constitute a system 1016 of expression and communication among people with a shared 1017 history or set of traditions. 1019 Layout - Information needed to display text to the user, similar to 1020 the presentation layer in the ISO telecommunications model. 1022 Locale - The attributes of communication, such as language, character 1023 set and cultural conventions. 1025 On-the-wire - The data that actually gets put into packets for 1026 transmission to other computers. 1028 Transfer Encoding Syntax - The mapping from a coded character set 1029 which has been encoded in a Character Encoding Scheme to an 1030 encoding which may be more suitable for transmission using 1031 specific protocols. For example, Base64 is a transfer encoding 1032 syntax. 1034 Draft RFC Character Set Workshop Report November 1996 1036 Appendix D: References 1038 [*] Non-ASCII character 1040 [ASCII] ANSI X3.4:1986 "Coded Character Sets - 7 Bit American 1041 National Standard Code for Information Interchange (7-bit ASCII)" 1043 [Base64] N. Borenstein, N. Freed, "MIME (Multipurpose Internet Mail 1044 Extensions) Part One: Mechanisms for Specifying and Describing 1045 the Format of Internet Message Bodies", RFC 1521, September 1993. 1047 [CEN] see http://tobbi.iti.is/TC304/welcome.html for current status. 1049 [HTML] T. Berners-Lee, D. Connolly, "Hypertext Markup Language - 2.0", 1050 RFC 1866, November 1995. 1052 [HTTP] T. Berners-Lee, R. Fielding, H. Nielsen, "Hypertext Transfer 1053 Protocol -- HTTP/1.0", RFC 1945, May 1996 1055 [I18N] Yergeau, F., et.al., "Internationalization of the Hypertext 1056 Markup Language" Internet draft August 1996. 1058 [IANA] Reynolds, J., and J. Postel, "Assigned Numbers", RFC 1700, 1059 ISI, October 1994. 1061 [ISO-2022] ISO/IEC 2022:1994, "Information technology -- Character 1062 Code Structure and Extension Techniques", JTC1/SC2. 1064 [ISO-7498] ISO/IEC 7498-1:1994, "Information technology - Open Systems 1065 Interconnection - Basic Reference Model: The Basic Model". 1067 [ISO-8859] Information Processing -- 8-bit Single-Byte Coded Graphic 1068 Character Sets -- Part 1: Latin Alphabet no. 1, 1069 ISO 8859-1:1987(E). Part 2: Latin Alphabet no. 2, ISO 8859-2 1070 1987(E). Part 3: Latin Alphabet no. 3, ISO 8859-3:1988(E). 1071 Part 4: Latin Alphabet no. 4, ISO 8859-4, 1988(E). Part 5: 1072 Latin/Cyrillic Alphabet ISO 8859-5, 1988(E). Part 6: 1073 Latin/Arabic Alphabet, ISO 8859-6, 1987(E). Part 7: Latin/Greek 1074 Alphabet, ISO 8859-7, 1987(E). Part 8: Latin/Hebrew Alphabet, ISO 1075 8859-8-1988(E).Part 9: Latin Alphabet no. 5, ISO 8859-9, 1990(E). 1076 Part 10: Latin Alphabet no. 6, ISO 8859-10:1992(E). 1078 [ISO-10646] ISO/IEC 10646-1:1993(E ), "Information technology -- 1079 Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: 1080 Architecture and Basic Multilingual Plane". JTC1/SC2, 1993 1082 [MIME] Borenstein, N., and N. Freed, "MIME (Multipurpose Internet 1083 Mail Extensions) Part One: Mechanisms for Specifying and 1084 Describing the Format of Internet Message Bodies", RFC 1521, 1085 Bellcore, Innosoft, September 1993. 1087 [POSIX] Institute of Electrical and Electronics Engineers. "IEEE 1088 standard interpretations for IEEE standard portable operating 1090 Draft RFC Character Set Workshop Report November 1996 1092 systems interface for computer environments". IEEE Std 1003.1 1093 -1988/Int, 1992 edition. Sponsor, Technical Committee on Operating 1094 Systems of the IEEE Computer Society. New York, NY: Institute of 1095 Electrical and Electronic Engineers, 1992. 1097 RFC 1340 See [IANA] 1099 [RFC 1345] Simonsen, K., "Character Mnemonics & Character Sets". 1100 Rationel Alim Planlaegning, June 1992. 1102 [RFC-1554] Ohta, M., and K. Handa, "ISO-2022-JP-2: Multilingual 1103 Extension of ISO-2022-JP", Tokyo Institute of Technology, ETL, 1104 December 1993. 1106 RFC 1642 See [UTF-7] 1108 [RFC 1766] Alvestrad, H., "Tags for the Identification of Languages", 1109 UNNETT, March 1995. 1111 [RFC 1958] Carpenter, B. (ed.) "Architectural Principles of the 1112 Internet", IAB, June 1996. 1114 [SGML] ISO 8879:1986 "Information Processing - Text and Office Systems 1115 - Standard Generalized Markup Language (SGML)" 1117 [SMTP] J. Postel, "Simple Mail Transfer Protocol", RFC 821, STD 10, 1118 August, 1982 1120 [Unicode] "The Unicode standard, version 2.0. Unicode Consortium. 1121 Reading, Mass.: Addison-Wesley Developers Press, 1996 1123 [UTF-7] Goldsmith, D., and M. Davis, "UTF-7: A Mail Safe 1124 Transformation Format of Unicode", RFC 1642, Taligent, Inc., July 1125 1994. 1127 [UTF-8] International Standards Organization, Joint Technical 1128 Committee 1 (ISO/JTC1), "Amendment 2:1993, UCS Transformation 1129 Format 8 (UTF-8)", in ISO/IEC 10646-1:1993 Information technology 1130 - Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: 1131 Architecture and Basic Multilingual Plane. JTC1/SC2, 1993. 1133 Draft RFC Character Set Workshop Report November 1996 1135 Appendix E: Recommended reading 1137 Alvestrand, H. "Tags for the Identification of Languages." RFC 1766. 1138 UNINETT, March 1995. 1140 Alvestrand, H. "X.400 Use of Extended Character Sets" RFC 1502. SINTEF 1141 DELAB, August 1993. 1143 Borenstein, N. "Implications of MIME for Internet Mail Gateways." 1144 RFC 1344. Bellcore, June 1992. 1146 Borenstein, N. and N. Freed. "MIME (Multipurpose Internet Mail 1147 Extensions) Part One: Mechanisms for Specifying and Describing the 1148 Format of Internet Message Bodies." RFC 1521. Bellcore and 1149 Innasoft, September 1993. 1151 Chernov, A. "Registration of a Cyrillic Character Set." RFC 1489. RELCOM 1152 Development Team, July 1993. 1154 Choi, U. and K. Chan. "Korean Character Encoding for Internet 1155 Messages." RFC 1557. KAIST, December 1993. 1157 Freed, N. and N. Borenstein. "Multipurpose Internet Mail Extensions 1158 (MIME) Part Two: Media Types." draft-ietf-822ext-mime-reg-02.txt. 1159 July 1993. 1161 Goldsmith, D., and M. Davis. "Transformation Format for Unicode." 1162 RFC 1642. Taligent, Inc., July 1994. 1164 Goldsmith, D., and M. Davis. "Using Unicode with MIME." RFC 1641. 1165 Taligent, Inc., July 1994. 1167 Jerman-Blazic, B. "Character handling in computer communication" in 1168 "user needs in information technology standards", Computer Weekly 1169 Professional service, eds. C.D. Evans, B.L. Meed & R.S. Walker, 1170 P.C. Butterworth Heineman, 1993, Oxford, Boston, p. 102-129. 1172 Jerman-Blazic, B. "Tool supporting the internationalization of the 1173 generic network services", Computer Networks and ISDN Systems, 1174 No. 27 (1994), p. 429-435. 1176 Jerman-Blazic, B., A. Gogala and D. Gabrijelcic, "Transparent language 1177 processing: A solution for internationalization of Internet 1178 services", The LISA Forum Newsletter, 5 (1996) p. 12-21 1180 Lee, F., "HZ - A Data Format for Exchanging Files of Arbitrarily Mixed 1181 Chinese and ASCII Characters." RFC 1843. Stanford University, 1182 August 1995. 1184 McCarthy, J. "Arbitrary Character Sets." RFC 373. Stanford 1185 University, July 1972. 1187 Draft RFC Character Set Workshop Report November 1996 1189 Moore, K. "MIME (Multipurpose Internet Mail Extensions) Part Two: 1190 Message Header Extensions for Non-ASCII Text." RFC 1522. 1191 University of Tennessee, September 1993. 1193 Murai, J., M. Crispin and E. von der Poel. "Japanese Character Encoding 1194 for Internet Messages." RFC 1468. Keio University & Panda 1195 Programming, June 1993. 1197 Nussbacher, H. "Handling of Bi-directional Texts in MIME." Israeli 1198 Inter-University, December 1993. 1200 Nussbacher, H. and Y. Bourvine. "Hebrew Character Encoding for Internet 1201 Messages." RFC 1555. Israeli Inter-University and Hebrew 1202 University, December 1993. 1204 Ohta, M. "Character Sets ISO-10646 and ISO-10646-J-1." RFC 1815. 1205 Tokyo Institute of Technology, July 1995. 1207 Postel, J. and J. Reynolds. "File Transfer Protocol (FTP)." RFC 959. 1208 ISI, October 1985. 1210 Postel, J. and J. Reynolds. "Telnet Protocol Specification." RFC 854. 1211 ISI, May 1983. 1213 Reynolds, J., and J. Postel, "Assigned Numbers", RFC 1700, 1214 ISI, October 1994. p.100-117. 1216 Rose, M. "The Internet Message", Prentice Hall, 1992 1218 Simonsen, K. "Character Mnemonics & Character Sets." RFC 1345. Rationel 1219 Almen Planlaegning, June 1992. 1221 Unicode Consortium. "The Unicode standard, version 2.0. Reading, 1222 Mass.: Addison-Wesley Developers Press, 1996 1224 Wei, U., et.al. "ASCII Printable Characters-Based Chinese Character 1225 Encoding for Internet Messages." RFC 1842. AsiInfo Services, 1226 Inc., et.al. August 1995. 1228 Zhu, H., et.al. "Chinese Character Encoding for Internet Messages." 1229 RFC 1922. Tsinghua University, et.al., March 1996 1231 Draft RFC Character Set Workshop Report November 1996 1233 Appendix F: Workshop attendee list 1235 These people were participants on the workshop mailing list. 1236 An * indicates that the person attended the workshop in person. 1238 Glenn Adams 1239 * Joan Aliprand 1240 * Harald Alvestrand 1241 * Ran Atkinson 1242 * Bert Bos 1243 * Brian Carpenter 1244 * Mark Crispin 1245 Makx Dekkers 1246 Robert Elz 1247 Patrik Faltstrom 1248 * Zhu Haifeng 1249 Keniichi Handa 1250 Olle Jarnefors 1251 Borka Jerman-Blazic 1252 John Klensin 1253 * Larry Masinter 1254 * Rick McGowan 1255 * Keith Moore 1256 * Lisa Moore 1257 Ruth Moulton 1258 * Cecilia Preston 1259 * Joyce Reynolds 1260 * Keld Simonsen 1261 * Gary Smith 1262 * Peter Svanberg 1263 * Chris Weider 1265 Draft RFC Character Set Workshop Report November 1996 1267 Appendix G: Author's addresses 1269 Chris Weider 1270 cweider@microsoft.com 1271 Microsoft Corp. 1272 1 Microsoft Way 1273 Redmond, WA 98052 1274 USA 1276 Cecilia Preston 1277 cecilia@well.com 1278 Preston & Lynch 1279 PO Box 8310 1280 Emeryville, CA 94662 1281 USA 1283 Keld Simonsen 1284 Keld@dkuug.dk 1285 DKUUG 1286 Freubjergvey 3 1287 DK-2100 Kxbenhavn X 1288 Danmark 1290 Harald T. Alvestrand 1291 Harald.T.Alvestrand@uninett.no 1292 UNINETT 1293 P.O.Box 6883 Elgeseter 1294 N-7002 TRONDHEIM 1295 NORWAY 1297 Randall Atkinson 1298 rja@cisco.com 1299 cisco Systems 1300 170 West Tasman Drive 1301 San Jose, CA 95134-1706 1302 USA 1304 Mark Crispin 1305 mrc@cac.washington.edu 1306 Networks & Distributed Computing 1307 University of Washington 1308 4545 15th Avenue NE 1309 Seattle, WA 98105-4527