idnits 2.17.1 draft-alvestrand-charset-policy-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-23) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand corner of the first page ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** Missing revision: the document name given in the document, 'draft-alvestrand-charset-', does not give the document revision number ~~ Missing draftname component: the document name given in the document, 'draft-alvestrand-charset-', does not seem to contain all the document name components required ('draft' prefix, document source, document name, and revision) -- see https://www.ietf.org/id-info/guidelines#naming for more information. == Mismatching filename: the document gives the document name as 'draft-alvestrand-charset-', but the file name used is 'draft-alvestrand-charset-policy-01' == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Abstract section. ** The document seems to lack an Introduction section. (A line matching the expected section header was found, but with an unexpected indentation: ' 1. Introduction' ) ** The document seems to lack a Security Considerations section. (A line matching the expected section header was found, but with an unexpected indentation: ' 7. Security considerations' ) ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 69: '...ocols, protocols MUST specify which pa...' RFC 2119 keyword, line 113: '... All protocols MUST identify, for al...' RFC 2119 keyword, line 116: '... Protocols MUST be able to use the ...' RFC 2119 keyword, line 121: '... Protocols MAY specify, in addition...' RFC 2119 keyword, line 134: '... UTF-8 support MUST be possible....' (7 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == Couldn't figure out when the document was first submitted -- there may comments or warnings related to the use of a disclaimer for pre-RFC5378 work that could not be issued because of this. Please check the Legal Provisions document at https://trustee.ietf.org/license-info to determine if you need the pre-RFC5378 disclaimer. -- The document date () is 739381 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'WR' on line 327 looks like a reference -- Missing reference section? 'RFC 2119' on line 323 looks like a reference -- Missing reference section? 'ARCH' on line 332 looks like a reference -- Missing reference section? 'REG' on line 341 looks like a reference -- Missing reference section? '10646' on line 318 looks like a reference -- Missing reference section? 'BCP9' on line 353 looks like a reference -- Missing reference section? 'POSIX' on line 336 looks like a reference -- Missing reference section? 'UTF-8' on line 348 looks like a reference Summary: 14 errors (**), 1 flaw (~~), 3 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 draft Charset policy June 97 3 IETF Policy on Character Sets and Languages 5 Fri Aug 29 10:41:03 MET DST 1997 7 Harald Tveit Alvestrand 8 UNINETT 9 Harald.T.Alvestrand@uninett.no 11 Status of this Memo 13 This draft document is being circulated for comment. 15 Please send comments to the author, or to the mailing list 18 The following text is required by the Internet-draft rules: 20 This document is an Internet Draft. Internet Drafts are working 21 documents of the Internet Engineering Task Force (IETF), its 22 Areas, and its Working Groups. Note that other groups may also 23 distribute working documents as Internet Drafts. 25 Internet Drafts are draft documents valid for a maximum of six 26 months. Internet Drafts may be updated, replaced, or obsoleted by 27 other documents at any time. It is not appropriate to use 28 Internet Drafts as reference material or to cite them other than 29 as a "working draft" or "work in progress." 31 Please check the I-D abstract listing contained in each Internet 32 Draft directory to learn the current status of this or any other 33 Internet Draft. 35 The file name of this version is draft-alvestrand-charset- 36 policy-01.txt 38 draft Charset policy June 97 40 1. Introduction 42 The Internet is international. 44 With the international Internet follows an absolute requirement to 45 interchange data in a multiplicity of languages, which in turn 46 utilize a bewildering number of characters. 48 This document is (INTENDED TO BE) the current policies being 49 applied by the Internet Engineering Steering Group towards the 50 standardization efforts in the Internet Engineering Task Force in 51 order to help Internet protocols fulfil these requirements. 53 The document is very much based upon the recommendations of the 54 IAB Character Set Workshop of February 29-March 1, 1996, which is 55 documented in RFC 2130 [WR]. This document attempts to be concise, 56 explicit and clear; people wanting more background are encouraged 57 to read RFC 2130. 59 The document uses the terms "MUST", "SHOULD" and "MAY", and their 60 negatives, in the way described in [RFC 2119]. In this case, "the 61 specification" as used by RFC 2119 refers to the processing of 62 protocols being submitted to the IETF standards process. 64 2. Where to do internationalization 66 Internationalization is for humans. This means that protocols are 67 not subject to internationalization; text strings are. Where 68 protocol elements look like text tokens, such as in many IETF 69 application layer protocols, protocols MUST specify which parts 70 are protocol and which are text. [WR 2.2.1.1] 72 Names are a problem, because people feel strongly about them, many 73 of them are mostly for local usage, and all of them tend to leak 74 out of the local context at times. RFC 1958 [ARCH] recommends US- 75 ASCII for all globally visible names. 77 This document does not mandate a policy on name 78 internationalization, but requires that all protocols describe 79 whether names are internationalized or US-ASCII. 81 NOTE: In the protocol stack for any given application, there is 82 usually one or a few layers that need to address these problems. 84 draft Charset policy June 97 86 It would, for instance, not be appropriate to define language tags 87 for Ethernet frames. But it is the responsibility of the WGs to 88 ensure that whenever responsibility for internationalization is 89 left to "another layer", those responsible for that layer are in 90 fact aware that they HAVE that responsibility. 92 3. Definition of Terms 94 This document uses the term "charset" to mean a set of rules for 95 mapping from a sequence of octets to a sequence of characters, 96 such as the combination of a coded character set and a character 97 encoding scheme; this is also what is used as an identifier in 98 MIME "charset=" parameters, and registered in the IANA charset 99 registry [REG]. (Note that this is NOT a term used by other 100 standards bodies, such as ISO). 102 For a definition of the term "coded character set", refer to the 103 workshop report. 105 A "name" is an identifier such as a person's name, a hostname, a 106 domainname, a filename or an E-mail address; it is often treated 107 as an identifier rather than as a piece of text, and is often used 108 in protocols as an identifier for entities, without surrounding 109 text. 111 3.1. What charset to use 113 All protocols MUST identify, for all character data, which charset 114 is in use. 116 Protocols MUST be able to use the UTF-8 charset, which consists of 117 the ISO 10646 coded character set combined with the UTF-8 118 character encoding scheme, as defined in [10646] Annex R 119 (published in Amendment 2), for all text. 121 Protocols MAY specify, in addition, how to use other charsets or 122 other character encoding schemes for ISO 10646, such as UTF-16, 123 but lack of an ability to use UTF-8 is a violation of this policy; 124 such a violation would need a variance procedure ([BCP9] section 125 9) with clear and solid justification in the protocol 126 specification document before being entered into or advanced upon 127 the standards track. 129 draft Charset policy June 97 131 For existing protocols or protocols that move data from existing 132 datastores, support of other charsets, or even using a default 133 other than UTF-8, may be a requirement. This is acceptable, but 134 UTF-8 support MUST be possible. 136 When using other charsets than UTF-8, these MUST be registered in 137 the IANA charset registry, if necessary by registering them when 138 the protocol is published. 140 (Note: ISO 10646 calls the UTF-8 CES a "Transformation Format" 141 rather than a "character encoding scheme", but it fits the charset 142 workshop report definition of a character encoding scheme). 144 3.2. How to decide a charset 146 In some cases, like HTTP, there is direct or semi-direct 147 communication between the producer and the consumer of data 148 containing text. In such cases, it may make sense to negotiate a 149 charset before sending data. 151 In other cases, like E-mail or stored data, there is no such 152 communication, and the best one can do is to make sure the charset 153 is clearly identified with the stored data, and choosing a charset 154 that is as widely known as possible. 156 Note that a charset is an absolute; text that is encoded in a 157 charset cannot be rendered comprehensibly without supporting that 158 charset. 160 (This also applies to English texts; charsets like EBCDIC do NOT 161 have ASCII as a proper subset) 163 Negotiating a charset may be regarded as an interim mechanism that 164 is to be supported until support for interchange of UTF-8 is 165 prevalent; however, the timeframe of "interim" may be at least 50 166 years, so there is every reason to think of it as permanent in 167 practice. 169 draft Charset policy June 97 171 4. Languages 173 4.1. The need for language information 175 All human-readable text has a language. 177 Many operations, including high quality formatting, text-to-speech 178 synthesis, searching, hyphenation, spellchecking and so on benefit 179 greatly from access to information about the language of a piece 180 of text. [WC 3.1.1.4]. 182 Humans have some tolerance for foreign languages, but are 183 generally very unhappy with being presented text in a language 184 they do not understand; this is why negotiation of language is 185 needed. 187 In most cases, machines will not be able to deduce the language of 188 a transmitted text by themselves; the protocol must specify how to 189 transfer the language information if it is to be available at all. 191 The interaction between language and processing is complex; for 192 instance, if I compare "name-of-thing(lang=en)" to "name-of- 193 thing(lang=no)" for equality, I will generally expect a match, 194 while the word "ask(no)" is a kind of tree, and is hardly useful 195 as a command verb. 197 4.2. Requirement for language tagging 199 Protocols that transfer text MUST provide for carrying information 200 about the language of that text. 202 Protocols SHOULD also provide for carrying information about the 203 language of names, where appropriate. 205 Note that this does NOT mean that such information must always be 206 present; the requirement is that if the sender of information 207 wishes to send information about the language of a text, the 208 protocol provides a well-defined way to carry this information. 210 draft Charset policy June 97 212 4.3. How to identify a language 214 The RFC 1766 language tag is at the moment the most flexible tool 215 available for identifying a language; protocols SHOULD use this, 216 or provide clear and solid justification for doing otherwise in 217 the document. 219 Note also that a language is distinct from a POSIX locale; a POSIX 220 locale identifies a set of cultural conventions, which may imply a 221 language (the POSIX or "C" locale of course do not), while a 222 language tag as described in RFC 1766 identifies only a language. 224 4.4. Considerations for language negotiation 226 Protocols where users have text presented to them in response to 227 user actions MUST provide for support of multiple languages. 229 How this is done will vary between protocols; for instance, in 230 some cases, a negotiation where the client proposes a set of 231 languages and the server replies with one is appropriate; in other 232 cases, a server may choose to send multiple variants of a text and 233 let the client pick which one to display. 235 Negotiation is useful in the case where one side of the protocol 236 exchange is able to present text in multiple languages to the 237 other side, and the other side has a preference for one of these; 238 the most common example is the text part of error responses, or 239 Web pages that are available in multiple languages. 241 Negotiating a language should be regarded as a permanent 242 requirement of the protocol that will not go away at any time in 243 the future. 245 In many cases, it should be possible to include it as part of the 246 connection establishment, together with authentication and other 247 preferences negotiation. 249 4.5. Default Language 251 When human-readable text must be presented in a context where the 252 sender has no knowledge of the recipient's language preferences 253 (such as login failures or E-mailed warnings, or prior to language 255 draft Charset policy June 97 257 negotiation), text SHOULD be presented in Default Language. 259 The Default Language is English, since this is the language which 260 the greatest number of people will be able to get adequate help in 261 interpreting when working with computers. 263 Note that negotiating English is NOT the same as Default Language; 264 Default Language is an emergency measure in otherwise unmanageable 265 situations. It may be appropriate for application designers to 266 make sure that messages in Default Language are understandable to 267 people with a limited understanding of the English language. 269 5. Locale 271 The POSIX standard [POSIX] defines a concept called a "locale", 272 which includes a lot of information about collating order for 273 sorting, date format, currency format and so on. 275 In some cases, and especially with text where the user is expected 276 to do processing on the text, locale information may be usefully 277 attached to the text; this would identify the sender's opinion 278 about appropriate rules to follow when processing the document, 279 which the recipient may choose to agree with or ignore. 281 This document does not require the communication of locale 282 information on all text, but encourages its inclusion when 283 appropriate. 285 Note that language and character set information will often be 286 present as parts of a locale tag (such as no_NO.iso-8859-1; the 287 language is before the underscore and the character set is after 288 the dot); care must be taken to define precisely which 289 specification of character set and language applies to any one 290 text item. 292 The default locale is the "POSIX" locale. 294 6. Documenting internationalization decisions 296 In documents that deal with internationalization issues at all, a 297 synopsis of the approaches chosen for internationalization SHOULD 298 be collected into a section called "Internationalization 300 draft Charset policy June 97 302 considerations", and placed next to the Security Considerations 303 section. 305 This provides an easy reference for those who are looking for 306 advice on these issues when implementing the protocol. 308 7. Security considerations 310 Apart from the fact that security warnings in a foreign language 311 may cause inappropriate behaviour from the user, and the fact that 312 multilingual systems usually have problems with consistency 313 between language variants, no security considerations relevant 314 have been identified. 316 8. References 318 [10646] 319 ISO/IEC, Information Technology - Universal Multiple-Octet 320 Coded Character Set (UCS) - Part 1: Architecture and Basic 321 Multilingual Plane, May 1993, with amendments 323 [RFC 2119] 324 S. Bradner, "Key words for use in RFCs to Indicate 325 Requirement Levels", 03/26/1997 - RFC 2119 327 [WR] C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. 328 Atkinson, M. Crispin, P. Svanberg, "The Report of the IAB 329 Character Set Workshop held 29 February - 1 March, 1996", 330 04/21/1997, RFC 2130 332 [ARCH] 333 B. Carpenter, "Architectural Principles of the Internet", 334 06/06/1996, RFC 1958 336 [POSIX] 337 ISO/IEC 9945-2:1993 Information technology -- Portable 338 Operating System Interface (POSIX) -- Part 2: Shell and 339 Utilities 341 [REG] 342 N. Freed, J. Postel: IANA Charset Registration Procedures, 344 draft Charset policy June 97 346 Work In Progress (draft-freed-charset-reg-02.txt) 348 [UTF-8] 349 F. Yergeau: UTF-8, a transformation format of Unicode and 350 ISO 10646, Work In Progress (draft-yergeau-utf8-rev-00.txt, 351 obsoletes RFC 2044) 353 [BCP9] 354 S. Bradner: The Internet Standards Process -- Revision 3. RFC 355 2026, BCP 9. 357 9. Author's address 359 Harald Tveit Alvestrand 360 UNINETT 361 P.O.Box 6883 Elgeseter 362 N-7002 TRONDHEIM 363 NORWAY 365 +47 73 59 70 94 366 Harald.T.Alvestrand@uninett.no