idnits 2.17.1 draft-ietf-html-i18n-03.txt: -(1610): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1618): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-03-29) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == There are 6 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 15 instances of too long lines in the document, the longest one being 14 characters in excess of 72. == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 181 has weird spacing: '...aracter encod...' == Line 182 has weird spacing: '...aracter encod...' == Line 184 has weird spacing: '...racters taken...' == Line 279 has weird spacing: '...mal) is unass...' == Line 280 has weird spacing: '...and reserved ...' == (3 more instances...) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (13 February 1996) is 10272 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'TAKADA' on line 1678 looks like a reference -- Missing reference section? 'NICOL' on line 1634 looks like a reference -- Missing reference section? 'ISO-8859-1' on line 1621 looks like a reference -- Missing reference section? 'ISO-8879' on line 1625 looks like a reference -- Missing reference section? 'ISO-10646' on line 1629 looks like a reference -- Missing reference section? '1' on line 180 looks like a reference -- Missing reference section? 'RFC1468' on line 1642 looks like a reference -- Missing reference section? 'RFC1521' on line 1646 looks like a reference -- Missing reference section? 'BRYAN88' on line 1585 looks like a reference -- Missing reference section? 'GOLD90' on line 1597 looks like a reference -- Missing reference section? 'VANH90' on line 1697 looks like a reference -- Missing reference section? 'SQ91' on line 1675 looks like a reference -- Missing reference section? 'UNICODE' on line 1685 looks like a reference -- Missing reference section? 'RFC1866' on line 1668 looks like a reference -- Missing reference section? '2' on line 278 looks like a reference -- Missing reference section? 'ERCS' on line 1589 looks like a reference -- Missing reference section? 'RFC1766' on line 1665 looks like a reference -- Missing reference section? 'ISO-639' on line 1610 looks like a reference -- Missing reference section? 'ETHNO' on line 1593 looks like a reference -- Missing reference section? 'ISO-3166' on line 1614 looks like a reference -- Missing reference section? 'SGML' on line 390 looks like a reference -- Missing reference section? 'RFC1738' on line 1661 looks like a reference -- Missing reference section? 'RFC1867' on line 1671 looks like a reference -- Missing reference section? '3' on line 612 looks like a reference -- Missing reference section? 'NICOL2' on line 1638 looks like a reference -- Missing reference section? 'RFC1642' on line 1657 looks like a reference -- Missing reference section? 'UTF-8' on line 1691 looks like a reference -- Missing reference section? 'ISO-8601' on line 1617 looks like a reference -- Missing reference section? 'RFC1590' on line 1651 looks like a reference -- Missing reference section? 'RFC1641' on line 1654 looks like a reference -- Missing reference section? 'TEI' on line 1682 looks like a reference Summary: 9 errors (**), 0 flaws (~~), 9 warnings (==), 33 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group F. Yergeau 3 Internet Draft G. Nicol 4 G. Adams 5 Expires 18 August 1996 M. Duerst 6 13 February 1996 8 Internationalization of the Hypertext Markup Language 10 Status of this Memo 12 This document is an Internet-Draft. Internet-Drafts are working doc- 13 uments of the Internet Engineering Task Force (IETF), its areas, and 14 its working groups. Note that other groups may also distribute work- 15 ing documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six 18 months. Internet-Drafts may be updated, replaced, or obsoleted by 19 other documents at any time. It is not appropriate to use Internet- 20 Drafts as reference material or to cite them other than as a "working 21 draft" or "work in progress". 23 To learn the current status of any Internet-Draft, please check the 24 1id-abstracts.txt listing contained in the Internet-Drafts Shadow 25 Directories on ds.internic.net (US East Coast), nic.nordu.net 26 (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific 27 Rim). 29 Distribution of this document is unlimited. Please send comments to 30 the HTML working group (HTML-WG) of the Internet Engineering Task 31 Force (IETF) at . Subscription address is . Discussions of the group are archived at URL: 33 http://www.acl.lanl.gov/HTML_WG/archives.html. 35 Abstract 37 The Hypertext Markup Language (HTML) is a simple markup language used 38 to create hypertext documents that are platform independent. Ini- 39 tially, the application of HTML on the World Wide Web was seriously 40 restricted by its reliance on the ISO-8859-1 coded character set, 41 which is appropriate only for Western European languages. Despite 42 this restriction, HTML has been widely used with other languages, 43 using other coded character sets or character encodings, at the 44 expense of interoperability. 46 This document is meant to address the issue of the 47 internationalization of HTML by extending the specification of HTML 48 and giving additional recommendations for proper internationalisation 49 support. A foremost consideration is to make sure that HTML remains 50 a valid application of SGML, while enabling its use in all languages 51 of the world. 53 Table of contents 55 1. Introduction .................................................. 2 56 1.1. Scope ...................................................... 3 57 1.2. Conformance ................................................ 3 58 2. The document character set ..................................... 4 59 2.1. Reference processing model ................................. 4 60 2.2. The document character set ................................. 6 61 2.3. Undisplayable characters ................................... 7 62 3. The LANG attribute.............................................. 7 63 4. Additional entities, attributes and elements ................... 9 64 4.1. Full Latin-1 entity set .................................... 9 65 4.2. Markup for language-dependent presentation ................. 9 66 5. Forms ..........................................................11 67 5.1. DTD additions ..............................................11 68 5.2. Form submission ............................................12 69 6. Miscellaneous ..................................................13 70 7. HTML public text ...............................................15 71 7.1. HTML DTD ...................................................15 72 7.2. SGML declaration for HTML ..................................30 73 7.3. ISO Latin 1 character entity set ...........................31 74 Bibliography ......................................................34 75 Authors' Addresses ................................................36 77 1. Introduction 79 The Hypertext Markup Language (HTML) is a simple markup language used 80 to create hypertext documents that are platform independent. Ini- 81 tially, the application of HTML on the World Wide Web was seriously 82 restricted by its reliance on the ISO-8859-1 coded character set, 83 which is appropriate only for Western European languages. Despite 84 this restriction, HTML has been widely used with other languages, 85 using other coded character sets or character encodings, through var- 86 ious ad hoc extensions to the language [TAKADA]. 88 This document is meant to address the issue of the internationaliza- 89 tion of HTML by extending the specification of HTML and giving addi- 90 tional recommendations for proper internationalisation support. It 91 is in good part based on a paper by one of the authors on multilin- 92 gualism on the WWW [NICOL]. A foremost consideration is to make sure 93 that HTML remains a valid application of SGML, while enabling its use 94 in all languages of the world. 96 The specific issues addressed are the SGML document character set to 97 be used for HTML, the proper treatment of the charset parameter asso- 98 ciated with the "text/html" content type and the specification of 99 language tags and additional entities. 101 1.1 Scope 103 HTML has been in use by the World-Wide Web (WWW) global information 104 initiative since 1990. This specification extends the capabilities 105 of HTML (RFC 1866), primarily by removing the restriction to the 106 ISO-8859-1 coded character set [ISO-8859-1]. 108 HTML is an application of ISO Standard 8879:1986, Information Pro- 109 cessing Text and Office Systems -- Standard Generalized Markup Lan- 110 guage (SGML) [ISO-8879]. The HTML Document Type Definition (DTD) is a 111 formal definition of the HTML syntax in terms of SGML. This specifi- 112 cation amends the DTD of HTML in order to make it applicable to docu- 113 ments encompassing a character repertoire much larger than that of 114 ISO-8859-1, while still remaining SGML conformant. 116 1.2 Conformance 118 This specification changes slightly the conformance requirements of 119 HTML documents and HTML user agents. 121 1.2.1 Documents 123 All HTML 2.0 conforming documents remain conforming with this speci- 124 fication. However, the extensions introduced here make valid cer- 125 tains documents that would not be HTML 2.0 conforming, in particular 126 those containing characters or character references outside of the 127 repertoire of ISO 8859-1, and those containing markup introduced 128 herein. 130 1.2.2. User agents 132 In addition to the requirements of RFC 1866, the following require- 133 ments are placed on HTML user agents. 135 To ensure interoperability and proper support for at least 136 ISO-8859-1 in an environment where character encoding schemes 137 other than ISO-8859-1 are present, user agents must correctly 138 interpret the charset parameter accompanying an HTML document 139 received from the network. 141 Furthermore, conforming user-agents are required to at least parse 142 correctly all numeric character references within the range of the 143 Basic Multilingual Plane (BMP) of ISO 10646-1 [ISO-10646]. 145 2. The document character set 147 2.1. Reference processing model 149 This overview explains a reference processing model used for HTML, 150 and in particular the SGML concept of a document character set. An 151 actual implementation may widely differ in its internal workings from 152 the model given below, but should behave as described to an outside 153 observer. 155 Because there are various widely differing encodings of text, SGML 156 does not directly address the question of how characters are encoded 157 e.g. in a file. SGML views the characters as a single set (called a 158 "character repertoire"), and a "code set" that assigns an integer 159 number (known as "character number") to each character in the reper- 160 toire. The document character set declaration defines what each of 161 the character numbers represents [GOLD90, p. 451]. In most cases, an 162 SGML DTD and all documents that refer to it have a single document 163 character set, and all markup and data characters are part of this 164 set. 166 HTML, as an application of SGML, does not directly address the ques- 167 tion of how characters are encoded as octets in external representa- 168 tions such as files. This is deferred to mechanisms external to HTML, 169 such as MIME as used by the HTTP protocol or by electronic mail. 171 For the HTTP protocol [HTTP-1.0], the way characters are encoded is 172 defined by the "charset" parameter [1] of the "Content-Type" field of 173 the header of an HTTP response. For example, to indicate that the 174 transmitted document is encoded in the "JIS" encoding of Japanese 175 [RFC1468], the header will contain the following line: 177 Content-Type: text/html; charset=ISO-2022-JP 179 _________________________ 180 [1] The term "charset" in MIME is used to designate a 181 character encoding, rather than a coded character set 182 as the term may suggest. A character encoding is a 183 mapping (possibly many-to-one) of a sequence of octets 184 to a sequence of characters taken from one or more 185 character repertoires. 187 The default charset parameter in the case of the HTTP protocol is 188 ISO-8859-1 (the so-called "Latin-1" for Western European characters). 189 The HTTP protocol also defines a mechanism for the client to specify 190 the character encodings it can accept. Clients and servers are 191 strongly requested to use these mechanisms to assure correct trans- 192 mission and interpretation of any document. Provisions that can be 193 taken to help correct interpretation, even in cases where a server or 194 client do not yet use these mechanisms, are described in section 6. 196 Similarly, if HTML documents are transferred by electronic mail, the 197 character encoding is defined by the "charset" parameter of the "Con- 198 tent-Type" MIME header line [RFC1521]. 200 In the case any other way of transferring and storing HTML documents 201 are defined or become popular, it is advised that similar provisions 202 should be made to clearly identify the character encoding used and/or 203 to use a single/default encoding capable of representing the widest 204 range of characters used in an international context. 206 Whatever the external character encoding actually be, the reference 207 processing model translates it to a representation of the document 208 character set specified in Section 2.2 before processing specific to 209 SGML/HTML. The reference processing model can be depicted as fol- 210 lows: 212 [resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display] 213 [manager] [parser] 214 ^ | 215 | | 216 +----------+ 218 The decoder is responsible for decoding the external representation 219 of the resource to a representation using the document character set. 220 The entity manager, the parser, and the application deal only with 221 characters of the document character set. A display-oriented part of 222 the application or the display machinery itself may again convert 223 characters represented in the document character set to some other 224 representation more suitable for their purpose. In any case, the 225 entity manager, the parser, and the application, as far as character 226 semantics are concerned, are using the HTML document character set 227 only. 229 An actual implementation may choose, or not, to translate the docu- 230 ment into some encoding of the document character set as described 231 above; the behaviour described by this reference processing model can 232 be achieved otherwise. This subject is well out of the scope of this 233 specification, however, and the reader is invited to consult the SGML 234 standard [ISO-8879] or a SGML handbook [BRYAN88] [GOLD90] [VANH90] 236 [SQ91] for further information. 238 The most important consequence of this reference processing model is 239 that numeric character references are always resolved with respect to 240 the fixed document character set, and thus to the same characters, 241 whatever the external encoding actually used. For an example, see 242 Section 2.2. 244 2.2. The document character set 246 The document character set, in the SGML sense, is the Basic Multilin- 247 gual Plane of ISO 10646:1993 [ISO-10646], also known as UCS-2. This 248 is code-by-code identical with the Unicode standard [UNICODE]. The 249 adoption of this document character set implies a change in the SGML 250 declaration specified in the HTML 2.0 specification (section 9.5 of 251 [RFC1866]). The change amounts to removing the two BASESET specifi- 252 cations and their accompanying DESCSET declarations, replacing them 253 with the following declaration: 255 BASESET "ISO Registration Number 176//CHARSET 256 ISO/IEC 10646-1:1993 UCS-2 with implementation level 3 257 //ESC 2/5 2/15 4/5" 258 DESCSET 0 9 UNUSED 259 9 2 9 260 11 2 UNUSED 261 13 1 13 262 14 18 UNUSED 263 32 95 32 264 127 1 UNUSED 265 128 32 UNUSED 266 160 65374 160 268 Making UCS-2 the document character set does not create non- 269 conformance of any expression, construct or document that is conform- 270 ing to HTML 2.0. It does make conforming certain constructs that are 271 not admissible in HTML 2.0. One consequence is that data characters 272 outside the repertoire of ISO-8859-1, but within that of UCS-2 become 273 valid SGML characters. Another is that the upper limit of the range 274 of numeric character references is extended from 255 to 65533 [2] ; 275 thus, И is a valid reference to a "CYRILLIC CAPITAL LETTER I". 276 [ERCS] is a good source of information on Unicode and SGML, although 277 _________________________ 278 [2] 65533 (FFFD hexadecimal) is the last valid char- 279 acter in UCS-2. 65534 (FFFE hexadecimal) is unassigned 280 and reserved as the byte-swapped version of ZERO WIDTH 281 NON-BREAKING SPACE for byte-sex detection purposes. 282 65535 (FFFF hexadecimal) is unassigned. 284 its scope and technical content differ greatly from this specifica- 285 tion. 287 ISO 10646-1:1993 is the most encompassing character set currently 288 existing, and there is no other character set that could take its 289 place as the document character set for HTML. Also, it is expected 290 that with future extensions of ISO 10646, this specification may also 291 be extended. If nevertheless for a specific application there is a 292 need to use characters outside this standard, this should be done by 293 avoiding any conflicts with present or future versions of ISO 10646, 294 i.e. by assigning these characters to a private zone. Also, it should 295 be borne in mind that such a use will be highly unportable; in many 296 cases, it may be better to use inline bitmaps. 298 2.3. Undisplayable characters 300 With the document character set being the full ISO 10646 BMP, the 301 possibility that a character cannot be displayed due to lack of 302 appropriate resources (fonts) cannot be avoided. Because there are 303 many different things that can be done in such a case, this document 304 does not prescribe any specific behaviour. Depending on the implemen- 305 tation, this may also be handled by the underlaying display system 306 and not the application itself. The following considerations, how- 307 ever, may be of help: 309 - A clearly visible, but unobtrusive behaviour should be preferred. 310 Some documents may contain many characters that cannot be renden- 311 dered, and so showing an alert for each of them is not the right 312 thing to do. 314 - In case a numeric representation of the missing character is 315 given, its hexadecimal (not decimal) form is to be preferred, 316 because this form is used in character set standards [ERCS]. 318 3. The LANG attribute 320 Language tags can be used to control rendering of a marked up docu- 321 ment in various ways: character disambiguation, in cases where the 322 character encoding is not sufficient to resolve to a specific glyph; 323 quotation marks; hyphenation; ligatures; spacing; voice synthesis; 324 etc. Independently of rendering issues, language markup is useful as 325 content markup for purposes such as classification and searching. 327 Since any text can logically be assigned a language, almost all HTML 328 elements admit the LANG attribute. The DTD reflects this. It is 329 also intended that any new element introduced in later versions of 330 HTML will admit the LANG attribute, unless there is a good reason not 331 to do so. 333 The language attribute, LANG, takes as its value a language tag that 334 identifies a natural language spoken, written, or otherwise conveyed 335 by human beings for communication of information to other human 336 beings. Computer languages are explicitly excluded. 338 The syntax and registry of HTML language tags is the same as that 339 defined by RFC 1766 [RFC1766]. In summary, a language tag is composed 340 of one or more parts: A primary language tag and a possibly empty 341 series of subtags: 343 language-tag = primary-tag *( "-" subtag ) 344 primary-tag = 1*8ALPHA 345 subtag = 1*8ALPHA 347 Whitespace is not allowed within the tag and all tags are case- 348 insensitive. The namespace of language tags is administered by the 349 IANA. Example tags include: 351 en, en-US, en-cockney, i-cherokee, x-pig-latin 353 Two-letter primary-tags are reserved for ISO 639 language abbrevia- 354 tions [ISO-639], and three-letter primary-tags for the language 355 abbreviations of the "Ethnologue" [ETHNO] (the latter is in addition 356 to the requirements of RFC 1766). Any two-letter initial subtag is an 357 ISO 3166 country code [ISO-3166]. 359 In the context of HTML, a language tag is not to be interpreted as a 360 single token, as per RFC 1766, but as a hierarchy. For example, a 361 user agent that adjusts rendering according to language should con- 362 sider that it has a match when a language tag in a style sheet entry 363 matches the initial portion of the language tag of an element. An 364 exact match should be preferred. This interpretation allows an ele- 365 ment marked up as, for instance, "en-US" to trigger styles corre- 366 sponding to, in order of preference, US-English ("en-US") or 'plain' 367 or 'international' English ("en"). 369 NOTE -- using the language tag as a hierarchy does not 370 imply that all languages with a common prefix will be 371 understood by those fluent in one or more of those lan- 372 guages; it simply allows the user to request this commonal- 373 ity when it is true for that user. 375 The rendering of elements may be affected by the LANG attribute. For 376 any element, the value of the LANG attribute overrides the value 377 specified by the LANG attribute of any enclosing element and the 378 value (if any) of the HTTP Content-Language header. If none of these 379 are set, a suitable default, perhaps controlled by user preferences, 380 by automatic context analysis or by the user's locale, should be used 381 to control rendering. 383 4. Additional entities, attributes and elements 385 4.1. Full Latin-1 entity set 387 According to the suggestion of section 14 of [RFC1866], the set of 388 Latin-1 entities is extended to cover the whole right part of 389 ISO-8859-1 (all code positions with the high-order bit set). The 390 names of the entities are taken from the appendices of [SGML]. A 391 list is provided in section 7.3 of this specification. 393 4.2. Markup for language-dependent presentation 395 For the correct presentation of text from certain languages (irre- 396 spective of formatting issues), some support in the form of addi- 397 tional entities and elements is needed. Markup is needed in some 398 cases to force or block joining behavior in contexts in which joining 399 would occur but should not or would not occur but should. 401 Many languages are written in horizontal lines from left to right, 402 while others are written from right to left. When both writing 403 directions are present, one talks of bidirectional text (BIDI for 404 short). BIDI text requires markup in special circumstances where 405 ambiguities as to the directionality of some characters have to be 406 resolved. This markup affects the ability to render BIDI text in a 407 semantically legible fashion. That is, without this special BIDI 408 markup, cases arise which would prevent *any* rendering whatsoever 409 that reflected the basic meaning of the text. Plain text may contain 410 this markup (joining or BIDI) in the form of special-purpose charac- 411 ters; in HTML, these are replaced by SGML markup as follows: 413 First, a generic container is needed to carry the LANG and DIR (see 414 below) attributes in cases where no other element is appropriate; the 415 SPAN element is introduced for that purpose. 417 A set of named character entities is added that allows partial sup- 418 port of the Unicode bidirectional algorithm [UNICODE], plus some help 419 with languages requiring contextual analysis for rendering: 421 422 423 424 426 Next, an attribute called DIR is introduced, restricted to the values 427 LTR (left-to-right) and RTL (right-to-left) and admitted by most ele- 428 ments. On block-type elements, the DIR attribute indicates the base 429 directionality of the text in the block; if omitted it is inherited 430 from the parent element. On inline elements, it makes the element 431 start a new embedding level (to be explained below); if omitted the 432 inline element does not start a new embedding level. 434 Lastly, a new element called BDO (BIDI override) is introduced, which 435 requires the DIR attribute to specify whether the override is left- 436 to-right or right-to-left. Its effect is to force the directionality 437 of all characters within it to the value of DIR, irrespective of 438 their intrinsic directional properties. 440 The zero-width joiner and non-joiner (‍ and ‌) are used to 441 control joining behaviour. For example, ARABIC LETTER HEH is used in 442 isolation to abbreviate "Hijri" (the Islamic calendrical system); 443 however, the initial form of the letter is desired, because the iso- 444 lated form of HEH looks like the digit five as employed in Arabic 445 script. This is obtained by following the HEH with a zero-width 446 joiner whose only effect is to provide context. In Persian texts, 447 there are cases where a letter that normally would join a subsequent 448 letter in a cursive connection does not. Here a zero-width non- 449 joiner is used. 451 The left-to-right and right-to-left marks (‎ and ‏) are used 452 to disambiguate directionality of neutral characters, e.g., if you 453 have a double quote sitting between an Arabic and a Latin letter, 454 then which direction does the quote resolve to? These characters are 455 like zero width spaces which have a directional property (but no 456 word/line break property). 458 Nested embeddings of contra-directional text runs is also a case 459 where the implicit directionality of characters is not sufficient, 460 requiring markup. A common need for the embedding controls is to han- 461 dle text that has been pasted from one bidi context to another, and 462 the possibility of multiply embedded pastings. Following is an exam- 463 ple of a case where embedding is needed, showing its effect: 465 Given the following latin (upper case) and arabic (lower 466 case) letters in backing store with the specified embed- 467 dings: 469 AB xy CD 470 zw EF 472 One gets the following rendering (with [] showing the 473 directional transitions): 475 [ AB [ wz [ CD ] yx ] EF ] 476 On the other hand, without this markup and with a base 477 direction of LTR one gets the following rendering: 479 [ AB [ yx ] CD [ wz ] EF ] 481 Notice that yx is on the left and wz on the right unlike 482 the above case where the embedding levels are used. With- 483 out the embedding markup one has at most two levels: a base 484 directional level and a single counterflow directional 485 level. 487 The directional override feature ()is needed to deal with 488 unusual pieces of text in which directionality cannot be resolved 489 from context in an unambiguous fashion. For example, it can be used 490 to force left-to-right (or right-to-left) display of part numbers 491 composed of Latin letters, digits and Hebrew letters. 493 A few other additional elements are important to have for proper lan- 494 guage-dependent rendering. 496 Short quotations, and in particular the quotation marks surrounding 497 them, are typically rendered differently in different languages and 498 on platforms with different graphic capabilities: "a quotation in 499 English", `another, slightly better one', ,,a quotation in German'', 500 << a quotation in French >>. The element is introduced for that 501 purpose. 503 Many languages require superscripts for proper rendering: as an exam- 504 ple, the French "Mlle Dupont" should have "lle" in superscript. The 505 element, and its sibling , are introduced to allow proper 506 markup of such text. and contents are restricted to 507 PCDATA to avoid nesting problems. 509 Finally, in many languages text justification is much more important 510 than it is in Western languages, and justifies markup. The ALIGN 511 attribute, admitting values of LEFT, RIGHT, CENTER and JUSTIFY, is 512 added to a selection of elements where it makes sense (block-like). 514 5. Forms 516 5.1. DTD additions 518 It is natural to expect input in any language in forms, as they pro- 519 vide one of the only ways of obtaining user input. While this is pri- 520 marily a UI issue, there are some things that should be specified at 521 the HTML level to guide behavior and promote interoperability. 523 To ensure full interoperability, it is necessary for the user agent 524 (and the user) to have an indication of the character encoding(s) 525 that the server providing a form will be able to handle upon submis- 526 sion of the filled-in form. Such an indication is provided by the 527 ACCEPT-CHARSET attribute of the INPUT and TEXTAREA elements, modeled 528 on the HTTP Accept-Charset header (see [HTTP-1.1]), which contains a 529 space and/or comma delimited list of character sets acceptable to the 530 server. A user agent may want to somehow advise the user of the con- 531 tents of this attribute, or to restrict his possibility to enter 532 characters outside the repertoires of the listed character sets. 534 NOTE -- The list of character sets is to be interpreted as 535 an EXCLUSIVE-OR list; the server announces that it is ready 536 to accept any ONE of these character encoding schemes for 537 each part of a multipart entity. The client may perform 538 character encoding translation to satisfy the server if 539 necessary. 541 NOTE -- The default value for the ACCEPT-CHARSET attribute 542 of an INPUT or TEXTAREA element is the reserved value 543 "UNKNOWN". A user agent may interpret that value as the 544 character encoding scheme that was used to transmit the 545 document containing that element. 547 5.2. Form submission 549 The HTML 2.0 form submission mechanism, based on the "application/x- 550 www-form-urlencoded" media type, is ill-equipped with regard to 551 internationalization. In fact, since URLs are restricted to ASCII 552 characters, the mechanism is akward even for ISO-8859-1 text. Sec- 553 tion 2.2 of [RFC1738] specifies that octets may be encoded using the 554 "%HH" notation, but text submitted from a form is composed of charac- 555 ters, not octets. Lacking a specification of a character encoding 556 scheme, the "%HH" notation has no well-defined meaning. 558 A partial solution to this sorry state of affairs is to specify a 559 default character encoding scheme to be assumed when the GET method 560 of form submission is used. Specifying UCS-2 would break all exist- 561 ing forms, so the only sensible way is to designate ISO-8859-1. That 562 is, the encoded URL sent to submit a form by the GET method is to be 563 interpreted as a sequence of single-octet characters encoded accord- 564 ing to ISO-8859-1, and further encoded according to the scheme of 565 [RFC1738] (the "%HH" notation). This is clearly insufficient, so 566 designers of forms are advised to use the POST method of form submis- 567 sion whenever possible. 569 A better solution is to add a MIME charset parameter to the 570 "application/x-www-form-urlencoded" media type specifier sent along 571 with a POST method form submission, with the understanding that the 572 URL encoding of [RFC1738] is applied on top of the specified charac- 573 ter encoding, as a kind of implicit Content-Transfer-Encoding. The 574 default ISO-8859-1 is implied in the absence of a charset parameter. 576 The best solution is to use the "multipart/form-data" media type 577 described in [RFC1867] with the POST method of form submission. This 578 mechanism encapsulates the value part of each name-value pair in a 579 body-part of a multipart MIME body that is sent as the HTTP entity; 580 each body part can be labeled with an appropriate Content-Type, 581 including if necessary a charset parameter that specifies the charac- 582 ter encoding scheme. The changes to the DTD necessary to support 583 this method of form submission have been incorporated in the DTD 584 included in this specification. 586 How the user agent determines the encoding of the text entered by the 587 user is outside the scope of this specification. 589 NOTE -- Designers of forms and their handling scripts 590 should be aware of an important caveat: when the default 591 value of a field (the VALUE attribute) is returned upon 592 form submission (i.e. the user did not modify this value), 593 it cannot be guaranteed to be transmitted as a sequence of 594 octets identical to that in the source document -- only as 595 a possibly different but valid encoding of the same 596 sequence of characters. 598 This may be true even if the encoding of the document con- 599 taining the form and that used for submission are the same, 600 because only the sequence of characters of the default 601 value, not the actual sequence of octets, may be counted on 602 to be preserved. 604 6. Miscellaneous 606 Proper interpretation of a text document requires that the character 607 encoding scheme be known. Current HTTP servers, however, do not gen- 608 erally include an appropriate charset parameter with the Content-Type 609 header, even when the encoding scheme is different from the default 610 ISO-8859-1. This is bad behaviour [3] , and as such strongly 611 _________________________ 612 [3] This bad behaviour is even encouraged by the con- 613 tinued existence of browsers that declare an unrecog- 614 nized media type when they receive a charset parameter. 615 User agent implementators are strongly encouraged to 616 make their software tolerant of this parameter, even if 617 they cannot take advantage of it. 619 discouraged, but some preventive measures can be taken to minimize 620 the detrimental effects. 622 In the case where a document is accessed from a hyperlink in an ori- 623 gin HTML document, a CHARSET attribute is added to the attribute list 624 of elements with link semantics (A and LINK), specifically by adding 625 it to the linkExtraAttributes entity. The value of that attribute is 626 to be considered a hint to the User Agent as to the character encod- 627 ing scheme used by the ressource pointed to by the hyperlink; it 628 should be the appropriate value of the MIME charset parameter for 629 that ressource. 631 In any document, it is possible to include an indication of the 632 encoding scheme like the following, as early as possible within the 633 HEAD of the document: 635 638 This is not foolproof, but will work if the encoding scheme is such 639 that ASCII characters stand for themselves at least until the META 640 element is parsed. Note that there are better ways for a server to 641 obtain character encoding information, instead of the unreliable 642 above; see [NICOL2] for some details and a proposal. 644 For definiteness, the "charset" parameter received from the source of 645 the document should be considered the most authoritative, followed in 646 order of preference by the contents of a META element such as the 647 above, and finally the CHARSET parameter of the anchor that was fol- 648 lowed (if any). 650 When HTML text is transmitted directly in UCS-2 651 (charset=UNICODE-1-1), the question of byte order arises: does the 652 high-order byte of each two-byte character come first or second? For 653 definiteness, this specification recommends that UCS-2 be transmitted 654 in big-endian byte order (high order byte first), which corresponds 655 to the established network byte order for two-byte quantities, to the 656 Unicode recommendation for serialized text data and to RFC 1641. 657 Furthermore, to maximize chances of proper interpretation, it is rec- 658 ommended that documents transmitted as UCS-2 always begin with a 659 ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF) which, 660 when byte-reversed becomes number FFFE, a character guaranteed to be 661 never assigned. Thus, a user-agent receiving an FFFE as the first 662 octets of a text would know that bytes have to be reversed for the 663 remainder of the text. 665 There exist so-called UCS Transformation Formats than can be used to 666 transmit UCS data, in addition to UCS-2. UTF-7 [RFC1642] and UTF-8 668 [UTF-8] have interesting properties (no byte-ordering problem, dif- 669 ferent flavours of ASCII compatibility) that make them worthy of con- 670 sideration, especially for transmission of multilingual text. The 671 UTF-1 transformation format of ISO 10646 (registered by IANA as 672 ISO-10646-UTF-1), has been removed from ISO 10646, and should not be 673 used. 675 7. HTML Public Text 677 7.1. HTML DTD 679 This section contains a DTD for HTML based on the HTML 2.0 DTD of RFC 680 1866, incorporating the changes for file upload as specified in RFC 681 1867, and the changes deriving from this document. 683 696 702 703 ... 704 705 -- 706 > 708 710 719 721 ]]> 723 732 738 744 746 751 755 757 759 761 767 770 772 774 %ISOlat1; 776 777 778 779 781 782 783 784 785 787 789 805 807 809 811 813 816 818 822 824 826 827 831 835 840 841 842 844 845 846 847 848 849 850 852 854 ]]> 856 857 858 862 864 865 870 872 873 879 881 882 889 891 892 896 901 902 904 905 907 916 Heading 919 is preferred to 920

Heading

921 --> 922 ]]> 924 926 927 " 933 > 934 935 936 937 938 939 940 941 942 944 946 947 #AttVal(Alt)" 954 > 956 957 958 959 960 962 964 965 971 973 975 976 982 984 985 990 995 1000 1005 1010 1016 1017 1018 1019 1020 1021 1023 1025 1027 ]]> 1029 1031 1033 ]]> 1035 1037 1041 1043 1044 1045 1051 1052 1054 1062 1063 1068 1074 1075 1077 1078 1080 1084 ]]> 1086 1088 1089 1096 1097 1102 1103 1108 1109 1110 1111 1113 1114 1120 1126 1127 1128 1129 1131 1132 Directory" 1138 > 1139 Menu" 1145 > 1147 1148 1149 1150 1152 1153 1159 1161 1163 Heading 1166

Text ... 1167 is preferred to 1168

Heading

1169 Text ... 1170 --> 1171 ]]> 1173 1176 1177 1181 1183 1184 1190 1192 1193 1200 1202 1204 1207 Form:" 1213 %SDASUFF; "Form End." 1214 > 1216 1217 1218 1219 1221 1224 1225 1240 1241 1242 1243 1244 1245 1246 1247 1248 1250 1251 Select #AttVal(Multiple)" 1259 > 1261 1262 1263 1264 1266 1267 1276 1277 1278 1280 1281 1291 1292 1293 1294 1296 ]]> 1298 1300 1302 ]]> 1303 1305 1307 1308 1311 1313 1314 1318 1320 1321 " > 1327 1328 1329 1330 1331 1332 1333 1334 1336 1337 [Document is indexed/searchable.]"> 1342 1344 1345 1348 1349 1351 1352 1355 1356 1358 1359 1364 1365 1366 1367 1369 1371 1373 ]]> 1374 1376 1377 1378 1384 1386 7.2. SGML Declaration for HTML 1388 1467 7.3. ISO Latin 1 entity set 1469 The following public text lists each of the characters specified in 1470 the Added Latin 1 entity set, along with its name, syntax for use, 1471 and description. This list is derived from ISO Standard 1472 8879:1986//ENTITIES Added Latin 1//EN. HTML includes the entire 1473 entity set, and adds entities for all missing characters in the right 1474 part of ISO-8859-1. 1476 1481 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1583 Bibliography 1585 [BRYAN88] M. Bryan, "SGML -- An Author's Guide to the Standard 1586 Generalized Markup Language", Addison-Wesley, Reading, 1587 1988. 1589 [ERCS] Extended Reference Concrete Syntax for SGML. 1590 1593 [ETHNO] "Ethnologue, Languages of the World", 12th Edition, 1594 Barbara F. Grimes, editor, Summer Institute of Lin- 1595 guistics, Dallas, 1992. 1597 [GOLD90] C. F. Goldfarb, "The SGML Handbook", Y. Rubinsky, Ed., 1598 Oxford University Press, 1990. 1600 [HTTP-1.0] T. Berners-Lee, R.T. Fielding, and H. Frystyk Nielsen, 1601 "Hypertext Transfer Protocol -- HTTP/1.0", Work in 1602 progress (draft-ietf-http-v10-spec-04.txt), MIT/LCS, 1603 UC Irvine, October 1995. 1605 [HTTP-1.1] R.T. Fielding, H. Frystyk Nielsen, and T. Berners-Lee, 1606 "Hypertext Transfer Protocol -- HTTP/1.1", Work in 1607 progress (draft-ietf-http-v11-spec-01.txt), MIT/LCS, 1608 January 1996. 1610 [ISO-639] ISO 639:1988. Codes pour la repr�sentation des noms de 1611 langue. Technical content in 1612 1614 [ISO-3166] ISO 3166:1993. Codes pour la repr�sentation des noms 1615 de pays. 1617 [ISO-8601] ISO 8601:1988. �l�ments de donn�es et formats 1618 d'�change -- �change d'information -- Repr�sentation 1619 de la date et de l'heure. 1621 [ISO-8859-1] ISO 8859-1:1987. International Standard -- Informa- 1622 tion Processing -- 8-bit Single-Byte Coded Graphic 1623 Character Sets -- Part 1: Latin Alphabet No. 1. 1625 [ISO-8879] ISO 8879:1986. International Standard -- Information 1626 Processing -- Text and Office Systems -- Standard Gen- 1627 eralized Markup Language (SGML). 1629 [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Infor- 1630 mation technology -- Universal Multiple-Octet Coded 1631 Character Set (UCS) -- Part 1: Architecture and Basic 1632 Multilingual Plane. 1634 [NICOL] G.T. Nicol, "The Multilingual World Wide Web", Elec- 1635 tronic Book Technologies, 1995, 1636 1638 [NICOL2] G.T. Nicol, "MIME Header Supplemented File Type", Work 1639 in progress, , 1640 EBT, October 1995. 1642 [RFC1468] J. Murai, M. Crispin and E. van der Poel, "Japanese 1643 Character Encoding for Internet Messages", RFC 1468, 1644 Keio University, Panda Programming, June 1993. 1646 [RFC1521] N. Borenstein and N. Freed, "MIME (Multipurpose Inter- 1647 net Mail Extensions) Part One: Mechanisms for Specify- 1648 ing and Describing the Format of Internet Message Bod- 1649 ies", RFC 1521, Bellcore, Innosoft, September 1993. 1651 [RFC1590] J. Postel, "Media Type Registration Procedure", RFC 1652 1590, USC/ISI, March 1994. 1654 [RFC1641] D. Goldsmith, M.Davis, "Using Unicode with MIME", RFC 1655 1641, Taligent inc., July 1994. 1657 [RFC1642] D. Goldsmith, M. Davis, "UTF-7: A Mail-safe Transfor- 1658 mation Format of Unicode", RFC 1642, Taligent inc., 1659 July 1994. 1661 [RFC1738] T. Berners-Lee, L. Masinter, and M. McCahill, "Uniform 1662 Resource Locators (URL)", RFC 1738, CERN, Xerox PARC, 1663 University of Minnesota, October 1994. 1665 [RFC1766] H. Alverstrand, "Tags for the Identification of Lan- 1666 guages", RFC 1766, UNINETT, March 1995. 1668 [RFC1866] T. Berners-Lee and D. Connolly, "Hypertext Markup Lan- 1669 guage - 2.0", RFC 1866, MIT/W3C, November 1995. 1671 [RFC1867] E. Nebel and L. Masinter, "Form-based File Upload in 1672 HTML", RFC 1867, Xerox Corporation, November 1995. 1673 August 1995. 1675 [SQ91] SoftQuad, "The SGML Primer", 3rd ed., SoftQuad Inc., 1676 1991. 1678 [TAKADA] Toshihiro Takada, "Multilingual Information Exchange 1679 through the World-Wide Web", Computer Networks and 1680 ISDN Systems, Vol. 27, No. 2, Nov. 1994 , p. 235-241. 1682 [TEI] TEI Guidelines for Electronic Text Encoding and Inter- 1683 change. 1685 [UNICODE] The Unicode Consortium, "The Unicode Standard -- 1686 Worldwide Character Encoding -- Version 1.0", Addison- 1687 Wesley, Volume 1, 1991, Volume 2, 1992. The BIDI 1688 algorithm is in appendix A of volume 1, with correc- 1689 tions in appendix D of volume 2. 1691 [UTF-8] X/Open Company Ltd., "File System Safe UCS Transforma- 1692 tion Format (FSS_UTF)", X/Open Preleminary Specifica- 1693 tion, Document Number P316. Also published in Unicode 1694 Technical Report #4 and soon in an annex to ISO 1695 10646-1. 1697 [VANH90] E. van Hervijnen, "Practical SGML", Kluwer Academicq 1698 Publishers Group, Norwell and Dordrecht, 1990. 1700 Authors' Addresses 1702 Fran�ois Yergeau 1703 Alis Technologies 1704 100, boul. Alexis-Nihon 1705 Suite 600 1706 Montr�al QC H4M 2P2 1707 Canada 1709 Tel: +1 (514) 747-2547 1710 Fax: +1 (514) 747-2561 1711 EMail: fyergeau@alis.com 1713 Gavin Thomas Nicol 1714 Electronic Book Technologies, Japan 1715 1-29-9 Tsurumaki, 1716 Setagaya-ku, 1717 Tokyo 1718 Japan 1720 Tel: +81-3-3230-8161 1721 Fax: +81-3-3230-8163 1722 EMail: gtn@ebt.com, gtn@twics.co.jp 1724 Glenn Adams 1725 Stonehand 1726 118 Magazine Street 1727 Cambridge, MA 02139 1728 U.S.A. 1730 Tel: +1 (617) 864-5524 1731 Fax: +1 (617) 864-4965 1732 EMail: glenn@stonehand.com 1734 Martin J. Duerst 1735 Multimedia-Laboratory 1736 Departement of Computer Science 1737 University of Zurich 1738 Winterthurerstrasse 190 1739 CH-8057 Zurich 1740 Switzerland 1742 Tel: +41 1 257 43 16 1743 Fax: +41 1 363 00 35 1744 E-mail: mduerst@ifi.unizh.ch