idnits 2.17.1 draft-benitez-winter-cultures-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-25) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 3 instances of too long lines in the document, the longest one being 31 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The "Author's Address" (or "Authors' Addresses") section title is misspelled. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'UNICODE' on line 1043 looks like a reference -- Missing reference section? 'I-HTML' on line 1031 looks like a reference -- Missing reference section? 'BRIAN' on line 1000 looks like a reference -- Missing reference section? 'CARRASCO-1' on line 1004 looks like a reference -- Missing reference section? 'CARRASCO-2' on line 1008 looks like a reference -- Missing reference section? 'CARRASCO-3' on line 1012 looks like a reference -- Missing reference section? 'CONNOLLY' on line 1017 looks like a reference -- Missing reference section? 'ISO-8859-1' on line 1036 looks like a reference -- Missing reference section? 'NICOL' on line 1040 looks like a reference -- Missing reference section? 'ZACK' on line 1047 looks like a reference Summary: 8 errors (**), 0 flaws (~~), 2 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT M.T. Carrasco Benitez 2 3 Expires November 16th 1996 May 16th, 1996 5 WInter 6 (Web Internationalization & Multilinguism) 8 Status of this Memo 10 This document is an Internet-Draft. Internet-Drafts are working 11 documents of the Internet Engineering Task Force (IETF), its areas, 12 and its working groups. Note that other groups may also distribute 13 working documents as Internet-Drafts. 15 Internet-Drafts are draft documents valid for a maximum of six 16 months and may be updated, replaced, or obsoleted by other documents 17 at any time. It is inappropriate to use Internet-Drafts as reference 18 material or to cite them other than as "work in progress". 20 To learn the current status of any Internet-Draft, please check 21 the "1id-abstracts.txt" listing contained in the Internet-Drafts 22 Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), 23 munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or 24 ftp.isi.edu (US West Coast). 26 Distribution of this document is unlimited. Please send comments 27 to the WInter mailing list at . Information 28 about the WInter mailing list, including subscription details are 29 in the WInter Page at: http://www.crpht.lu/~carrasco/winter 31 Abstract 33 This document discusses the Internationalization & Multilinguism 34 of the Web. A Web capable of supporting different cultures, natural 35 languages and Language Engineering facilities such as Parallel 36 Texts. Internationalization permeates most subsystems: client, 37 transmission, server, data and authoring; the primitive mechanism 38 for WIntering should be part of the Web foundations. 40 Table of Contents 42 1. Introduction 43 1.1 Mandate 44 1.2 Writing style 45 1.3 Terminology 47 2. Character Set 48 2.1 Back office 49 2.2 Front office 50 2.3 Multilingual typography 51 2.4 The characters in the URL 53 3. Internationalization & localization 54 3.1 Elements of localization 55 3.2 Messages as HTML pages 57 4. Multilinguism 59 5. Parallel Hypertext 60 5.1 Definition 61 5.2 Language tags 62 5.3 Document request 63 5.4 Parallel Hypertext Data Structure (PHDS) 64 5.5 Linking strategy 65 5.6 Generation of parallel texts 66 5.6.1 Language dependent strings 67 5.6.2 Language-void document 69 6. Bidirectionallity (BIDI) 70 7. The LANG attribute 71 8. LINKs 72 9. Multilingual thesaurus 73 10. Electronic Data Interchange (EDI) 74 11. Passing selected text to a CGI 75 12. Reference model for Internationalization & Multilinguism 76 13. VRML 77 14. Java 79 15. Dragoman 80 15.1 Interactive Search 81 15.2 The Translation Folder (full preprocessing) 82 15.3 Preprocessing for Machine Translation 83 15.4 Machine Translation 84 15.5 Pseudo-Automatic Translation (PAT) 85 15.6 Document Generation 86 15.7 Document Comparison 87 15.8 Author's Workbench 88 15.9 Terminology Verification 89 15.10 Multilingual Aligned Text Editor 90 15.11 Printing 92 16. Acknowledgments 93 17. Bibliography 94 18. Author Address 96 1. Introduction 98 The intention of this document is to consider all aspects for 99 WIntering. It aims to fulfill two functions: 101 - A catalogue of issues 103 - A primer 105 To a very large extend, it puts together the efforts of other 106 groups. It goes in more details when materials are not covered 107 elsewhere. 109 An Internationalized & Multilingual Web should have the traditional 110 facilities of Internationalization and more advanced facilities 111 needed for Language Engineering. For example, clients should have 112 a language menu (similar to edit or file menus) that shows in which 113 other linguistic versions the currently displayed document is 114 available; or clients should be capable of displaying and moving 115 in sync side by side, two linguistic versions of the same document. 117 "Another noteworthy characteristic of this manual is that it doesn't 118 always tell the truth. When certain concepts of TEX are introduced 119 informally, general rules will be stated; afterwards you will find 120 that the rules aren't strictly true." 122 The TEXbook Donald E. Knuth 124 The above quote particularly applies to the documents resumed in 125 this document. Though the intention is to make this document 126 selfcontained by resuming or quoting other documents, it is strongly 127 recommended to consult the source documents. 129 1.1 Mandate 130 One of the recommendation of the Internationalization Workshop 131 during the Fifth International WWW Conference in Paris on May 6th 132 1996, was that a document should be maintained to fulfill the 133 purpose described in the above introduction. The author accepted 134 the task and the present document is the result. 136 1.2 Writing style 137 A special effort should be made to make this document as accessible 138 as possible to non-computer specialists (e.g., linguists) and 139 non-English native speakers. Due to the characteristics of WInter, 140 there should be a significant number of both. This does not imply 141 that there should be one type of document for each type of participant. 142 It means that this document should be accessible to all participants. 143 Perhaps by adopting a journalistic style and re-stating the evident. 144 The overhead should be small and it is good to avoid misunderstanding, 145 even between people of the same field. 147 Comments regarding the writing style from journalists or readers 148 with similar profiles are very welcome; i.e., non-computer specialists 149 that have to explain computer materials to other non-computer 150 specialists. Some of the suggestions could be what additional 151 material should be included to make this document more selfcontained; 152 and what terms should be replaced to make it more accessible. But, 153 the gory normative details must be present. 155 1.3 Terminology 156 Alignedness 157 It is a quality of Parallel Texts; for example, the Treaty of Rome 158 in English and Spanish are Parallel Texts and they should be aligned. 159 The interesting part is aligning Parallel Texts automatically. 161 Author-Translator-Publisher Chain (ATP-chain) 162 It refers to the integration of all the phases in the production 163 of documents. Usually, large distributed systems. 165 Globalization 166 In the context of electronic commerce, the mechanisms to facilitate 167 global trade. Internationalization & Multilinguism are some of 168 these mechanisms. A legal framework is an example of a non computer 169 mechanism. 171 I18N 172 Abbreviation for Internationalization. The 18 refers to the characters 173 nternationalizatio. 175 Language Engineering 176 Language Engineering is the application of computer science to 177 natural languages. For example: 179 - Terminology 181 - Translator's Memory 183 - Multilingual documentary databases 185 - Aligned Text 186 - Translator's Workbench 188 - Author's Workbench 190 - Machine Translation 192 - Publishing (in particular, multilingual synchronized publishing) 194 Level of Alignedness 195 This is a metric of alignedness. According to which depth it is 196 possible to identify the Linguistic Objects, the texts are aligned 197 at: 199 - Document level: the trivial case; i.e., Parallel Texts. 201 - Paragraph level: not too hard to achieve. 203 - Sentence level: desirable and possible to achieve. 205 - Term level: it needs tagging for automatic alignedness. 207 - Word level: it needs tagging for automatic alignedness. 209 In this context, sentence is a part of a text delimited by a dot, 210 semicolon or similar; i.e., it has little grammatical meaning and 211 the main interest is to identify Linguistic Objects. 213 Linguistic Object 214 Linguistic Object is a unit of language representation. It can be 215 a fixed language representation (term, abbreviation, title, segment, 216 phrase, paragraph, etc) or meta-language representation (a grammatical 217 construction, etc). More general, a Linguistic Object is a discrete 218 linguistic unit (usually a string) whose meaning is created by the 219 program treating it. 221 Multilingual Aligned Text (MAT) 222 A MAT is a record in a table with one Linguistic Object per language 223 field (English, Spanish, German, etc) that are the equivalence 224 (usually the translation) of each other. There are other fields 225 for classification and other purposes. MATs constitute independent 226 elements of a table; i.e., there is no ordering in the table. The 227 end result is a data structure similar to a multilingual dictionary. 229 Parallel Texts 230 Texts that are translations of each other. For example, the Treaty 231 of Rome in English and Spanish are Parallel Texts. Parallel Texts 232 could be aligned to several levels. 234 WInter 235 It stands for Web Internationalization & Multilinguism. 237 2. Character Set 239 A large character set is a basic prerequisite for having 240 Internationalization & Multilinguism. The bottom line is that the 241 Web must be capable of handling Unicode [UNICODE]. 243 The character set should be considered a low level layer; i.e., 244 like the pieces of wires in the seven layers ISO Reference Model 245 (physical, datalink, network, etc). Other functionalities should 246 be in other layers. There is a tendency in overloading this layer, 247 by opposition to defining new layers. 249 There are two aspects to the character set: 251 The Back office 252 It deals with storage in disk, transmission, representation in the 253 document, etc 255 The Front office 256 It is concerned with rendering on the screen or printer. 258 2.1 Back office 259 Latin-1[ ISO -8859-1] is the default character set for the Web. 260 Latin-1 is only sufficient for Western European languages. Latin-1 261 is an 8-bits encoding. This permits a maximum of 256 characters. 263 Unicode (ISO 10646 BMP) is a large character set that includes most 264 of the world languages. Unicode is a 16-bits encoding. This permits 265 over 65,000 characters. At present, over 25,000 positions are still 266 free. This form is also called UCS-2; i.e., Universal Character 267 Set 2-bytes. Unicode is the first plane of ISO 10646 (see below); 268 this plane is also called BMP (Basic Multilingual Plane) or Plane 269 Zero. The Internationalization of the Hypertext Markup Language 270 [I-HTML] proposes Unicode as the document character set. 272 ISO 10646 is a 32-bits encoding. It is divided into 32,000 planes, 273 each with 65,000 characters capacity. This permits 2,080 million 274 characters. This form is also called UCS-4, Universal Character 275 Set 4-bytes. Only the first plane (Unicode) is in use. 277 UTF-8, (Universal Character Set Transformation Format) is an addendum 278 to ISO 10646. It provides compatibility with ASCII and the ASCII 279 characters are represented by 1 byte (8 bits) and not 4 bytes (32 280 bits). In general, it is economical with the bytes used in the 281 encoding. 283 [HTTP-1.1] allows for the character set to be negotiated. For 284 example, the client and server can agree on using Unicode. 286 2.2 Front office 287 Rendering is drawing the glyphs (graphic representation of the 288 characters) on the screen or printer. This is the job of the browser 289 and the browser depends on the graphical facilities of the computer. 291 Undisplayable characters are the characters that cannot be displayed 292 due to the lack of facilities. The I-HTML "does not prescribe any 293 specific behavior", but notes some "considerations". WInter recommends 294 the following: 296 - The behavior of undisplayable characters must be controlled by 297 the options setting of the browser 299 - Some options can be combined. 301 - There must be a small Undisplayable Characters Flag in the 302 browser part of the screen, not in the document part. Something 303 similar to the red button indicating that the browser is loading 304 a document, but smaller. The flag must be ON if the current document 305 contains one or more undisplayable characters. The presence or 306 absence of the flag must be user definable. 308 - Undisplayable Character Tolerance is a user definable value in 309 the range from 0 to 10, that signals the behavior of the browser. 311 - 0 Undisplayable Character Tolerance means ignore all undisplayable 312 characters. 314 - 5 Undisplayable Character Tolerance means a reasonable default 315 warning for undisplayable characters. This behaviour must be defined. 316 For example, show only up to 10 continuous undisplayable characters 317 and try remaps, such as "e'" to "e". 319 - 10 Undisplayable Character Tolerance means show one Replacement 320 Glyph for each undisplayable character. 322 - The other intermediary values must change gradually. 324 - Undefined Undisplayable Character Tolerance must gravitate 325 towards the default value (5). 327 - The undisplayable characters must be remapable to a user definable 328 Replacement Glyph for example, "_". Or one of several numeric 329 representations; for example, hexadecimal or decimal. 331 - The default Replacement Glyph must occupy approximately the same 332 space as the average glyph in the document. It must be a box 333 containing the Unicode value in hex. 335 Font Servers could supply the browser with missing glyphs. 337 2.3 Multilingual typography 338 {The proposition of Martin Dvrst will be resumed here.} 340 2.4 The characters in the URL 341 The characters allowed in the URL are a subset of ASCII. URL where 342 supposed to be hidden, but they are very visible and important 343 commercially: firms want to spell their names with accents. The 344 most urgent is to have a large character set for the query part. 345 There have been propositions on using UTF-8. URL needs a lot of 346 work. 348 3. Internationalization & localization 350 Internationalized softwares are developed without the cultural 351 characteristics embedded. They can be localized parametrically for 352 different cultures; for example, the same software can run for 353 Germany with the German conventions, or for Italy with the Italian 354 conventions. 356 Internationalization is a well known field; for example, a significant 357 amount of effort was done during the POSIX (Unix) standardization. 358 The mechanisms must be sufficient for implementing the localizations. 359 Localization itself is usually discussed in other fora; for example, 360 how to represent the date in Germany. Most conventions have been 361 already agreed. 363 Any number of cultures (real or imaginary) are possible. For example, 364 France, Germany, European Commission. In the case of the European 365 Commission, it has to work in the eleven official languages (including 366 Greek), and with cross-cultural conventions or with the national 367 conventions. 369 3.1 Elements of localization 370 Languages 371 Two aspects: 373 - Language strings in the software. 375 - Data in the document. 377 Example, the software could be in German and the document shown in 378 French. 380 Sorting order 381 Number representation 382 Example, the internal number could be 12345.67 and the external 383 representation could be 12,345.67 or 12.345,67. 385 Date & Time 386 Example, the internal representation could be 19951231 and the 387 external representation could be December 31th 1995, or 31-12-1995. 389 Short quotations 390 Example, 392 - "I am a Berliner" (English) 394 - <> (French) 396 - ,,Ich bin ein Berliner'' (German) 398 The new element in I-HTML is for this purpose. 400 New internationalization elements should be added to this list, 401 for example, color. 403 The software should be localized from a list of preferred localization, 404 and switchable from one localization to another without re-starting 405 the application. 407 3.2 Messages as HTML pages 408 The Status-Code and the Reason-Phrase (see 6.1.1, HTTP-1.1) are 409 presented as HTML pages. These are Language strings in the software 410 but are usually presented as data documents. For example, 404: Not 411 Found. 413 The localization of the Reason-Phrase can be done by the client or 414 the server. If the client can do a better job, it has to drop the 415 page sent by the server and generate the localized page from the 416 Status-Code and the LANG tag. 418 4. Multilinguism 420 Multilinguism deals with advanced language facilities, often several 421 languages simultaneously. It is also referred as Language Engineering. 422 This comes from the tradition of specialized software for Language 423 Engineering, such as Translator's Workbench. One of the main 424 applications is the processing of Parallel Texts. 426 Most of the softwares in Language Engineering are incompatible and 427 there are practically no standards in this field. Usually, researchers 428 or vendors start from scratch and develop all the modules; even 429 horizontal modules such as user interfaces and data structures, 430 rather than concentrate in the engines for language processing (for 431 aiding the translator, machine translation, etc). 433 One of the main inmediate objective in Language Engineering must 434 be the creation of standards that clearly separate data and software; 435 i.e., it should be possible to adquire a translation aid program 436 from one vendor and the dictionaries from another vendor. 438 The purpose is not making every browser a Translator's Workbench, 439 though browsers could do with more advanced language facilities 440 that are usually found in internationalized products. But the 441 standards must allow the construction of Translator's Workbenches 442 based on the Web technology. 444 After security and the application for secure payment over the 445 Internet, Language Engineering is one of the applications most 446 relevant from an economical point of view; in intranets, with less 447 security requirements, it is probably the most important. It is as 448 horizontal as publishing and, indeed, it is the second phase in 449 the ATP-chain (Author-Translator-Publisher). Translating is expensive 450 and very human intensive. For most texts, machine translation is 451 not acceptable. On the other hand, translating aiding tools are 452 very cost effective. Particularly, if integrated in an ATP-chain. 453 Saving in translating tends to be big. 455 5. Parallel Hypertext 457 5.1 Definition 458 Parallel Hypertext is an extension of the hypertext paradigm to 459 natural languages. For example, a user looking at a document in 460 English should be able to obtain the Spanish version in a transparent 461 way; i.e., just by selecting the Spanish option in a language menu 462 and not by selecting a link embedded in the English version. For 463 this, the Web must know about languages; i.e., the same in another 464 language. The same property of alignedness in Parallel Texts applied 465 to Parallel Hypertext. 467 5.2 Language tags 468 The language tags (see 3.10, HTTP-1.1) are composed of a primary 469 language tag and one or more subtags that could be empty. 471 Examples: 473 en 474 en-US 475 en-cockney 477 There must be a way to indicate 479 - Human translation 481 - Machine translation 483 - Transliteration 485 This could be part of a subtag or inside the document. 486 {Examples will be added.} 488 5.3 Document request 489 Clients should be able to request documents at least in the following 490 ways: 492 - A document is requested according to a preference language list 493 that could be the same list used for choosing the display labels 494 in the user interface. The server must respond with best linguistic 495 version and the list of available linguistic versions. The best 496 linguistic version means the nearer to the top of the list and if 497 none is available, the nearer to the top of the defaults in the 498 server. In this case, the browser probably does not know what are 499 the available linguistic versions. 500 {This will be developed.} 502 - A document is requested in one specific language. The server 503 must respond only with that linguistic version (no other is 504 acceptable) and the list of available linguistic versions. In this 505 case, the client probably knows that the requested version is 506 available; it could be the result of a previous conversation with 507 the server. 509 Example: 511 - Conversation 1 512 Client : Give me MyDoc with this order of preference: Danish, 513 English or German 514 Server : Take MyDoc in German; it is available in German, Italian and Spanish 516 - Conversation 2 517 Client : Give me MyDoc only in Spanish 518 Server : Take MyDoc in Spanish; it is available in German, Italian 519 and Spanish 521 The linguistic versions of the document could be in different servers. 523 This could be done with the Accept-Language and Content-Language 524 facilities (see 10.4 and 10.11, HTTP-1.1). 526 The parameter in Accept-Language: 528 Quality factor "q" is decribed as "... estimate of the user's 529 comprehension of that language ..." . But the user indicates his 530 language preference list and there is no need to use the parameter 531 with this meaning. It would be more usefull to indicate the "minimum 532 acceptable quality of the translation". Some of the translation 533 could be done by more or less experienced translators; or machine 534 translation. 536 A different usage could be to indicate the level of alignedness. 538 Maximum acceptable size "mxb" is not used. It could indicate the 539 number of linguistic versions desired. 541 An Accept-Language with a single language parameter must mean that 542 the browser only wants that linguistic version and not another. 544 The Content-Language "... describes the natural language(s) of the 545 intended audience ...". The meaning of this field should be "the 546 list of linguistic versions available"; it should be used by the 547 browser to update the language menu, so the user could know which 548 other linguistic versions are available. 550 5.4 Parallel Hypertext Data Structure (PHDS) 551 One Parallel Hypertext Data Structure contains all the information 552 for one Parallel Hypertext Document. The Parallel Hypertext Data 553 Structure must allow the following: 555 - Several data schemes. For example, directory, SGML, tar, etc 557 - Keeping the linguistic versions in different servers 559 - Conversation with monolingual clients. In this case, the user 560 must know the structure 562 The Parallel Hypertext Data Structure has two parts: 564 The PHDS-Header 565 Contains administrative data. For example, where is the German 566 linguistic version. The data is divided into structured fields. 568 The PHDS-Body 569 Contains the linguistic data. It has one section per language. 571 The PHDS-Header is always a HTML file. This file must fulfill two 572 functions: 574 - Allowing a user to select one linguistic version 576 - Be used by WIntered Web programs (clients/servers) as a 577 datastructure to locate the pertinent linguistic version 579 The PHDS-Header must contain at least the following information: 581 - Name 583 - DataScheme 585 - DataLocation (for all the parts) 587 The DataSchema applies only to the PHDS-Body. The PHDS-Header is 588 always a HTML. 590 {An example of a file in HTML will be added.} 592 The default for a single set of files is: 594 DocName.html (PHDS-Header) 596 DocNameDir (PHDS-Body, a directory) 597 /en.html English (PHDS-Body language section) 598 /es.html Spanish (PHDS-Body language section) 599 /de.html German (PHDS-Body language section) 601 The default for several sets of files is: 603 DocName.html (PHDS-Header) 605 DocNameDir (PHDS-Body, a directory) 606 /en/DocName1.html English (PHDS-Body language section) 607 /en/DocName2.html English (PHDS-Body language section) 609 /es/DocName1.html Spanish (PHDS-Body language section) 610 /es/DocName2.html Spanish (PHDS-Body language section) 612 /de/DocName1.html German (PHDS-Body language section) 613 /de/DocName2.html German (PHDS-Body language section) 615 The DocName.html should be usable directly by the present clients 616 (browsers) and/or indirectly to generate HTML files of the fly. 617 Multilingual clients should use the information to access the 618 documents in a transparent way. 620 Requesting a URL of a PHDS-Header must get the linguistic version 621 according to the rules of the language preferences. Requesting a 622 URL of a PHDS-Body language section must get that linguistic version. 624 The server must know at least the following defaults: 626 - language with the explicit links 628 - preferred language list 630 - MAT table 632 {This will be extended.} 634 A standard data structure for Parallel Hypertext would be of use 635 for anybody working with Parallel Texts, independently if the Web 636 is used or not. For example, CD-ROMs could be published with Parallel 637 Texts for language processing programs, such as Machine Translation, 638 that would know what to expect. At present, there is no standard 639 for Parallel Texts or MAT. 641 The relation with Text Encoding Initiative (TEI) will be explored. 643 5.5 Linking strategy 644 The linking strategy must minimize the maintenance. This is essential 645 for large multilingual documentary databases. For example, the 646 millions of pages of the European Institutions in eleven languages. 647 Only one linguistic version should have explicit links; i.e., the 648 links as used today that are physically present in the documents. 649 The other linguistic versions would have implicit links; i.e. links 650 that would not be physically present in the texts, but they could 651 be calculated by the alignedness of the different linguistic 652 versions. 654 The generation of implicit links could be client, server and/or 655 authoring affair: 657 - Client.- A client could receive a linguistic version with explicit 658 links and a linguistic version with implicit links. The client 659 would display the linguistic version with the explicit links or it 660 would calculate the implicit links on the fly and display the 661 result. 663 - Server.- A multilingual server could process documents with 664 implicit links and generate on fly documents with explicit links. 666 - Authoring.- An interactive or batch authoring system could 667 process documents with implicit links and it could create new 668 documents with explicit links; the server would not know how the 669 new documents were created. 671 These options should be considered as a continuum and (some) are 672 not mutually exclusive: most degrees between the extremes are 673 possible. For example, servers could be able to create documents 674 on the fly and they could be using documents with the links generated 675 by authoring systems. Indeed, a mixture could be the most probable 676 case. 678 The level of alignedness should be calculated in advance and kept 679 in the Parallel Hypertext Data Structure. Some documents widely 680 regarded as aligned because they were revised over half a dozen 681 time and they have been heavily used for decades (best-case 682 documents); once submitted to a computer program, it came to light 683 that they were not aligned even to paragraph level. 685 The linked text (i.e., what goes between and ) would 686 have to be at least to the level to which the texts are aligned. 687 For example, for texts aligned only at paragraph level, it is not 688 possible to calculate implicit links at sentence level. A corollary 689 is that texts aligned at document level can have implicit links 690 only at the beginning or at the end. 692 The links would have to be at least at sentence level. It would be 693 hard to place implicit links in part of a sentence without tagging: 694 the second text should have null links; named null links if there 695 are several in one sentence. 697 Examples: 699 - No need for null links in the second text. A whole sentence is 700 linked in the first text and finding the place for the implicit 701 links in the second text is easy. 703 The white table. The black table The green table. 704 La mesa blanca. La mesa negra. La mesa verde. 705 (implicit link) 706 - It needs a null link in the second text. Only part of a sentence 707 is linked in the first text and finding the place for the implicit 708 link in the second text is hard; i.e., it cannot be done with simple 709 strings processing and it needs computational linguistics. 711 The white table. The black table The green table. 712 La mesa blanca. La mesa negra. La mesa verde. 713 (null link) 715 5.6 Generation of parallel texts 716 The linguistic versions could be generated through machine translation 717 or other techniques. For example, a system could have documents in 718 Spanish and a program for translation to English. The user should 719 be informed by the language menu into which languages and with 720 which techniques (MT, human translator, etc) the documents are 721 available. 723 {This will be extended.} 725 5.6.1 Language dependent strings 726 These are tags to be replaced by language string (Linguistic Object) 727 according to the language requested. For example, the following 728 shows the content of a HTML document and the resulting replacement; 729 assuming that the language requested is German and that the Linguistic 730 Object corresponding to the identifier String_1 is the German phrase 731 below: 733 735 Ich bin ein Berliner 737 5.6.2 Language-void document 738 A document without any language string; i.e., it contains only 739 language dependent strings. In this case, only one HTML document 740 is needed and not one per language; this HTML document could be 741 considered a mask. A database with Linguistic Objects is needed. 742 The same Linguistic Object can be used in several documents. 744 This technique could be used for the localization of the messages send by the server as HTML documents. 746 6. Bidirectionallity (BIDI) 748 (see 4.2, I-HTML) 749 {A resume from the I-HTML will be inserted.} 750 7. The LANG attribute 752 (see 3, I-HTML) 753 {A resume from the I-HTML will be inserted.} 755 8. LINKs 757 758 759 760 {This will be exteneded.} 762 9. Multilingual thesaurus 764 This is a tool for finding references to the search in any language. 765 For example, if the string in the search is "table" it should also 766 find the Spanish document with the word "mesa" (table in Spanish). 768 10. Electronic Data Internchange (EDI) 770 Many EDI messages are printed. As the EDI messages are very 771 structured, a translation of the message could be shown using 772 Pseudo-Automatic Translation (PAT). 774 11. Passing selected text to a CGI 776 To consult terminological databases easly, it should be possible 777 to pass selected string (with the mouse or other) to CGI programs 778 or similar. This is a generic mechanism. 780 12. Reference model for Internationalization & Multilinguism 782 This is a very first trial and further work is needed. The model 783 is layered, similar to the seven layers ISO Reference Model (physical, 784 datalink, network, etc). A different approach could be needed; for 785 example, a vector approach. 787 LayerNumber LayerName Example 789 1 compression gzip 790 2 transformation UTF-8 791 3 character set Unicode (65, "LATIN CAPITAL LETTER A") 792 4 glyph "A" 793 5 font Time 795 Other items to put into the model: 797 - sorting order 799 - language (e.g., Korean) 801 There is a general tendency to overload the character set layer. 802 For example, wishing to allocate two code positions to the same 803 ideogram because it means different things in different languages. 805 13. VRML 807 How objects negotiate when they speak different languages ? 808 {This will be developped.} 810 14. Java 812 {This will be developped.} 814 15. Dragoman 816 This section is included mostly to illustrate the kind of applications 817 for multilinguism. 819 Dragoman is a reference model for Language Engineering. It uses 820 Multilingual Aligned Hypertext technique. In essence, Dagroman 821 describes a Database (part structured and part documental) and 822 Services that can be implemented over the (multilingual ) Database. 823 Often, different data structures are used for the Services described 824 below. 826 The Web paradigm is particularly well adapted to Dragoman. The term 827 Dragoman has nothing to do with dragons; it means language interpreter. 829 What follows is a very brief description of some of the Services 830 that could be implemented over the Database. There could be several 831 programs offering the same Service. Services processing whole 832 documents could be implemented in batch; particularly if they are 833 using a very large Database (several gigabytes). 835 15.1 Interactive Search 836 Selects the Multilingual Aligned Texts (MAT) that match a search 837 criteria. The search is fuzzy (e.g. 87% match). Unfound requests 838 are valuable information that must be processed further. The system 839 must keep trace of the unfound requests to put in contact people 840 with similar needs (matchmaker); the user must decide what is a 841 typing error and what is a genuine unfound request. Also the user 842 can send messages to terminologists (demand driven terminology). 844 15.2 The Translation Folder (full preprocessing) 845 The objective is to obtain a complete Translation Folder for a 846 given document. Hence, the translator should not need to consult 847 dictionaries, databases, glossaries, nomenclature list, etc. It is 848 like having a hundred assistants preparing the text for the 849 translator. In a typical Translation Folder, some paragraphs should 850 be fully translated and some paragraphs should be a mixture of full 851 sentences, segments, titles, terms, nomenclatures, etc (all these 852 items are packaged as Linguistic Objects); background documents 853 could also be taken into account. The Linguistic Objects are marked 854 with the Status; for example, unverified, verified, compulsory, 855 etc. The search follows a fuzzy biggest chunk heuristic. Traditionally 856 there are two texts, source and target. But there could be any 857 number of language fields. This could be the most useful Service 858 for the translator and it should be implemented early. The translator 859 could use the result on paper or on the screen. 861 15.3 Preprocessing for Machine Translation 862 Similar to the Translation Folder. It should be adapted to an 863 (existing) machine translation program that follows up the processing. 864 For example, select only exact matches (no fuzzy) and terms in the 865 unfound phrases; the machine translation program would translate 866 only the unfound phrases. 868 15.4 Machine Translation 869 A Machine Translation program that uses the Database directly. For 870 example, a program could combine perfect matches, process the easy 871 fuzzy matches such as dates, pure Machine Translation, etc. 873 15.5 Pseudo-Automatic Translation (PAT) 874 Similar to the Translation Folder, but where all the texts are 875 found with a 100% match (no fuzzy search). The program should be 876 restricted to a collection of records; i.e., it should not be 877 allowed to roam the Database as there could be bad surprises. In 878 particular, one must avoid word by word translation; hence one must 879 be very careful with small Multilingual Aligned Texts (for example, 880 a one-word Multilingual Aligned Text). 882 15.6 Document Generation 883 All the linguistic versions of a document are generated camera 884 ready. There is no source and translation as such, the index is 885 created, the typesetting (nearly) done. This is the most useful 886 Service for the Organization. It is a very efficient way to produce 887 documents. The three phases Author-Translator-Publisher (ATP-chain) 888 are highly integrated. It is particularly adapted to periodic 889 publications. The production of standardized documents is trivial. 891 Documents in several linguistic versions are often required to be 892 synchronized; i.e., each page in each linguistic version must 893 contain the same content and the same lay-out (text, number of 894 paragraphs, etc). The typesetting, including the synchronization, 895 must be automated and each page should not be processed by a human; 896 a human operator should intervene only to fine-tune the publication. 897 TeX should be considered. 899 A document might need several representations; for example, typesetted 900 for the Official Journal, formatted for a CD-ROM or marked in HTML 901 (for CD-ROM or server). First, a document in SGML should be generated; 902 indeed, the SGML document is the document. All the following 903 representations should be created from the SGML document. This 904 method should guarantee that all the representations have the same 905 content. 907 With such a system in place, the creation of secondary products is 908 easy. For example, a Parliamentary Commission could work with a 909 draft of the Budget typesetted like the Official Journal, in all 910 the linguistic versions, enriched with hidden comments. 912 15.7 Document Comparison 913 The user directs the program to a document similar to the one that 914 has to be translated. The new pieces could be fetched in the 915 Database. This program could work without the Database, though the 916 new pieces would not be fetched. Similar translations could arise 917 as a version of a previous document and as a new similar document. 919 15.8 Author's Workbench 920 Authors could use a similar technique to Translation Folder and 921 Document Comparison. The unknown parts of the text would be marked 922 and in certain cases alternatives would be proposed. Texts created 923 with the translation phase in mind are easier to translate. Ideally, 924 the author should aim to produce a text for translation with 925 Pseudo-Automatic Translation. 927 15.9 Terminology Verification 928 The objective is to verify the Consistency and Harmonization of 929 the terminology. The concepts are closely related and they can be 930 combined, but they are not the same. 932 - Consistency is naming the same object with the same term. It is 933 an internal characteristic of a set of documents (the unitary set 934 is allowed) and it does not need a Database. The more linguistic 935 versions of the set of documents the better. 937 - Harmonization is imposing a term by the Terminological Authority. 938 It is an external characteristic of the document and it needs a 939 Database with the harmonized terms. 941 15.10 Multilingual Aligned Text Editor 942 An editor shows at least two (aligned) texts, it moves the texts 943 in sync, it highlights the differences, etc. 945 15.11 Printing 946 A program that prints one or several Multilingual Aligned Text side 947 by side. It could be the following step after the Translation 948 Folder. Multilingual Aligned Texts (source and target) on paper 949 allow the translator to use traditional tools such as dictating. 951 16. Acknowledgments 953 This document makes heavy use from the documents cited in the texts. 954 Particularly from the relevant RFC and IETF-drafts. 956 Also from the following: 958 - Web Multilinguism. BOF meeting, Third International WWW Conference 960 - Web Internationalization. BOF meeting, Fourth International WWW 961 Conference 963 - Web Internationalization & Multilinguism. BOF meeting, Fifth 964 International WWW Conference 966 - Internationalization Workshop. Fifth International WWW Conference 968 - WInter mailing list 970 - Informal talks/communications (probably the most fruitful) 972 The BOF meetings were organized by the author. 974 Martin Duerst made many suggestions to the position paper of the 975 author for the Internationalization Workshop during the Fifth 976 International WWW Conference. The present document is over 80% 977 based on the position paper. He commented the Reference model and 978 I expect him to come back with further suggestions. 980 In such fluid circumstances, it is nearly impossible to attribute 981 credits. Though it particularly comes to mind, 983 Bert Bos 984 Martin Bryan 985 Martin Dvrst 986 Albert Lunde 987 Larry Masinter 988 Gavin Nicol 989 Steven Pemberton 990 Christine Stark 991 Fran[ois Yergeau 992 Faith Zack 994 The author tries to look for consensus and borrowed heavily from 995 many sources. On the other hand, he is the only responsible for 996 any shortcomings and the opinions expressed. 998 17. Bibliography 1000 [BRIAN] Martin Bryan, "Using HyTime to Link Translations", contribution 1001 to the WInter mailing list, 1002 http://www.crpht.lu/~carrasco/winter/hytime.html 1004 [CARRASCO-1] M.T. Carrasco Benitez, "On the multilingual normalization 1005 of the Web", Poster for the Third International WWW Conference, 1006 http://www.crpht.lu/~carrasco/winter/poster.html 1008 [CARRASCO-2] M.T. Carrasco Benitez, "Web Internationalization", 1009 Poster for the Fourth International WWW Conference, 1010 http://www.crpht.lu/~carrasco/winter/inter.html 1012 [CARRASCO-3] M.T. Carrasco Benitez, "WInter (Web Internationalization 1013 & Multilinguism0", Position paper for the Internationalization 1014 Workshop during the Fifth International WWW Conference, 1015 http://www.crpht.lu/~carrasco/winter/popa.html 1017 [CONNOLLY] "Character Set Considered Harmful", 1018 http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html 1020 [HTML 2.0] T. Berners-Lee, D. Connolly, "HTML 2.0", RFC 1866, 1021 http://www.ics.uci.edu/pub/ietf/html/rfc1866.txt 1023 [HTML 3.0] "HTML 3.0", expired Internet-Draft, 1024 http://www.hpl.hp.co.uk/people/dsr/html3/CoverPage.html 1026 [HTTP-1.1] R.T. Fielding, H. Frystyk Nielsen, and T. Berners-Lee, 1027 "Hypertext Transfer Protocol -- HTTP/1.1", Work in progress 1028 (draft-ietf-http-v11-spec-01.txt) MIT/LCS, January 1996. 1029 http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-v11-spec-01.html, 1031 [I-HTML] F. Yergeau, G. Nicol, G. Adams, M. Duerts, "Internationalization 1032 of the Hypertext Markup Language", Work in progress, 1033 (draft-ietf-html-i18n-03.txt) 1034 http://www.alis.com:8085/ietf/html/draft-ietf-html-i18n.txt 1036 [ISO-8859-1] ISO 8859-1:1987. International Standard -- Information 1037 Processing -- 8-bit Single-Byte Coded Graphic Character Sets -- 1038 Part 1: Latin Alphabet No. 1. 1040 [NICOL] G. T. Nicol, "The Multilingual WWW" 1041 http://www.ebt.com:8080/docs/multilingual-www.html 1043 [UNICODE] The Unicode Consortium, "The Unicode Standard -- Worldwide 1044 Character Encoding -- Version 1.0", Addison-Wesley, Volume 1, 1991, 1045 Volume 2, 1992. http://www.unicode.org 1047 [ZACK] F. Zack, "Serving Multilingual Online Documentation", Poster 1048 for the Fifth International WWW Conference 1050 {This list will be completed.} 1052 18. Author Address 1054 Manuel Tomas CARRASCO BENITEZ 1055 carrasco@innet.lu 1056 http://www.crpht.lu/~carrasco/winter