idnits 2.17.1 

draft-benitez-winter-cultures-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-25) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There are 3 instances of too long lines in the document, the longest one
     being 31 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The "Author's Address" (or "Authors' Addresses") section title is
     misspelled.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'UNICODE' on line 1043 looks like a reference

  -- Missing reference section? 'I-HTML' on line 1031 looks like a reference

  -- Missing reference section? 'BRIAN' on line 1000 looks like a reference

  -- Missing reference section? 'CARRASCO-1' on line 1004 looks like a
     reference

  -- Missing reference section? 'CARRASCO-2' on line 1008 looks like a
     reference

  -- Missing reference section? 'CARRASCO-3' on line 1012 looks like a
     reference

  -- Missing reference section? 'CONNOLLY' on line 1017 looks like a reference

  -- Missing reference section? 'ISO-8859-1' on line 1036 looks like a
     reference

  -- Missing reference section? 'NICOL' on line 1040 looks like a reference

  -- Missing reference section? 'ZACK' on line 1047 looks like a reference


     Summary: 8 errors (**), 0 flaws (~~), 2 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                     M.T. Carrasco Benitez
2	<draft-benitez-winter-cultures-00.txt>
3	Expires November 16th 1996                                May 16th, 1996

5	                                 WInter
6	               (Web Internationalization & Multilinguism)

8	Status of this Memo

10	This document is an Internet-Draft. Internet-Drafts are working
11	documents of the Internet Engineering Task Force (IETF), its areas,
12	and its working groups. Note that other groups may also distribute
13	working documents as Internet-Drafts.

15	Internet-Drafts are draft documents valid for a maximum of six
16	months and may be updated, replaced, or obsoleted by other documents
17	at any time. It is inappropriate to use Internet-Drafts as reference
18	material or to cite them other than as "work in progress".

20	To learn the current status of any Internet-Draft, please check
21	the "1id-abstracts.txt" listing contained in the Internet-Drafts
22	Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
23	munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
24	ftp.isi.edu (US West Coast).

26	Distribution of this document is unlimited. Please send comments
27	to the WInter mailing list at <winter@dorado.crpht.lu>. Information
28	about the WInter mailing list, including subscription details are
29	in the WInter Page at:  http://www.crpht.lu/~carrasco/winter

31	Abstract

33	This document discusses the Internationalization & Multilinguism
34	of the Web. A Web capable of supporting different cultures, natural
35	languages and Language Engineering facilities such as Parallel
36	Texts. Internationalization permeates most subsystems: client,
37	transmission, server, data and authoring; the primitive mechanism
38	for WIntering should be part of the Web foundations.

40	Table of Contents

42	1. Introduction
43	  1.1 Mandate
44	  1.2 Writing style
45	  1.3 Terminology

47	2. Character Set
48	  2.1 Back office
49	  2.2 Front office
50	  2.3 Multilingual typography
51	  2.4 The characters in the URL

53	3. Internationalization & localization
54	  3.1 Elements of localization
55	  3.2 Messages as HTML pages

57	4. Multilinguism

59	5. Parallel Hypertext
60	  5.1 Definition
61	  5.2 Language tags
62	  5.3 Document request
63	  5.4 Parallel Hypertext Data Structure (PHDS)
64	  5.5 Linking strategy
65	  5.6 Generation of parallel texts
66	    5.6.1 Language dependent strings
67	    5.6.2 Language-void document

69	6. Bidirectionallity (BIDI)
70	7. The LANG attribute
71	8. LINKs
72	9. Multilingual thesaurus
73	10. Electronic Data Interchange (EDI)
74	11. Passing selected text to a CGI
75	12. Reference model for Internationalization & Multilinguism
76	13. VRML
77	14. Java

79	15. Dragoman
80	  15.1 Interactive Search
81	  15.2 The Translation Folder (full preprocessing)
82	  15.3 Preprocessing for Machine Translation
83	  15.4 Machine Translation
84	  15.5 Pseudo-Automatic Translation (PAT)
85	  15.6 Document Generation
86	  15.7 Document Comparison
87	  15.8 Author's Workbench
88	  15.9 Terminology Verification
89	  15.10 Multilingual Aligned Text Editor
90	  15.11 Printing

92	16. Acknowledgments
93	17. Bibliography
94	18. Author Address

96	1. Introduction

98	The intention of this document is to consider all aspects for
99	WIntering. It aims to fulfill two functions:

101	-  A catalogue of issues

103	-  A primer

105	To a very large extend, it puts together the efforts of other
106	groups. It goes in more details when materials are not covered
107	elsewhere.

109	An Internationalized & Multilingual Web should have the traditional
110	facilities of Internationalization and more advanced facilities
111	needed for Language Engineering. For example, clients should have
112	a language menu (similar to edit or file menus) that shows in which
113	other linguistic versions the currently displayed document is
114	available; or clients should be capable of displaying and moving
115	in sync side by side, two linguistic versions of the same document.

117	"Another noteworthy characteristic of this manual is that it doesn't
118	always tell the truth. When certain concepts of TEX are introduced
119	informally, general rules will be stated; afterwards you will find
120	that the rules aren't strictly true."

122	The TEXbook Donald E. Knuth

124	The above quote particularly applies to the documents resumed in
125	this document. Though the intention is to make this document
126	selfcontained by resuming or quoting other documents, it is strongly
127	recommended to consult the source documents.

129	1.1 Mandate
130	One of the recommendation of the Internationalization Workshop
131	during the Fifth International WWW Conference in Paris on May 6th
132	1996, was that a document should be maintained to fulfill the
133	purpose described in the above introduction. The author accepted
134	the task and the present document is the result.

136	1.2 Writing style
137	A special effort should be made to make this document as accessible
138	as possible to non-computer specialists (e.g., linguists) and
139	non-English native speakers. Due to the characteristics of WInter,
140	there should be a significant number of both. This does not imply
141	that there should be one type of document for each type of participant.
142	It means that this document should be accessible to all participants.
143	Perhaps by adopting a journalistic style and re-stating the evident.
144	The overhead should be small and it is good to avoid misunderstanding,
145	even between people of the same field.

147	Comments regarding the writing style from journalists or readers
148	with similar profiles are very welcome; i.e., non-computer specialists
149	that have to explain computer materials to other non-computer
150	specialists. Some of the suggestions could be what additional
151	material should be included to make this document more selfcontained;
152	and what terms should be replaced to make it more accessible. But,
153	the gory normative details must be present.

155	1.3 Terminology
156	Alignedness
157	It is a quality of Parallel Texts; for example, the Treaty of Rome
158	in English and Spanish are Parallel Texts and they should be aligned.
159	The interesting part is aligning Parallel Texts automatically.

161	Author-Translator-Publisher Chain (ATP-chain)
162	It refers to the integration of all the phases in the production
163	of documents. Usually, large distributed systems.

165	Globalization
166	In the context of electronic commerce, the mechanisms to facilitate
167	global trade. Internationalization & Multilinguism are some of
168	these mechanisms. A legal framework is an example of a non computer
169	mechanism.

171	I18N
172	Abbreviation for Internationalization. The 18 refers to the characters
173	nternationalizatio.

175	Language Engineering
176	Language Engineering is the application of computer science to
177	natural languages. For example:

179	-  Terminology

181	-  Translator's Memory

183	-  Multilingual documentary databases

185	-  Aligned Text
186	-  Translator's Workbench

188	-  Author's Workbench

190	-  Machine Translation

192	-  Publishing (in particular, multilingual synchronized publishing)

194	Level of Alignedness
195	This is a metric of alignedness. According to which depth it is
196	possible to identify the Linguistic Objects, the texts are aligned
197	at:

199	-  Document level: the trivial case; i.e., Parallel Texts.

201	-  Paragraph level: not too hard to achieve.

203	-  Sentence level: desirable and possible to achieve.

205	-  Term level: it needs tagging for automatic alignedness.

207	-  Word level: it needs tagging for automatic alignedness.

209	In this context, sentence is a part of a text delimited by a dot,
210	semicolon or similar; i.e., it has little grammatical meaning and
211	the main interest is to identify Linguistic Objects.

213	Linguistic Object
214	Linguistic Object is a unit of language representation. It can be
215	a fixed language representation (term, abbreviation, title, segment,
216	phrase, paragraph, etc) or meta-language representation (a grammatical
217	construction, etc). More general, a Linguistic Object is a discrete
218	linguistic unit (usually a string) whose meaning is created by the
219	program treating it.

221	Multilingual Aligned Text (MAT)
222	A MAT is a record in a table with one Linguistic Object per language
223	field (English, Spanish, German, etc) that are the equivalence
224	(usually the translation) of each other. There are other fields
225	for classification and other purposes. MATs constitute independent
226	elements of a table; i.e., there is no ordering in the table. The
227	end result is a data structure similar to a multilingual dictionary.

229	Parallel Texts
230	Texts that are translations of each other. For example, the Treaty
231	of Rome in English and Spanish are Parallel Texts. Parallel Texts
232	could be aligned to several levels.

234	WInter
235	It stands for Web Internationalization & Multilinguism.

237	2. Character Set

239	A large character set is a basic prerequisite for having
240	Internationalization & Multilinguism. The bottom line is that the
241	Web must be capable of handling Unicode [UNICODE].

243	The character set should be considered a low level layer; i.e.,
244	like the pieces of wires in the seven layers ISO Reference Model
245	(physical, datalink, network, etc). Other functionalities should
246	be in other layers. There is a tendency in overloading this layer,
247	by opposition to defining new layers.

249	There are two aspects to the character set:

251	The Back office
252	It deals with storage in disk, transmission, representation in the
253	document, etc

255	The Front office
256	It is concerned with rendering on the screen or printer.

258	2.1 Back office
259	Latin-1[ ISO -8859-1] is the default character set for the Web.
260	Latin-1 is only sufficient for Western European languages. Latin-1
261	is an 8-bits encoding. This permits a maximum of 256 characters.

263	Unicode (ISO 10646 BMP) is a large character set that includes most
264	of the world languages. Unicode is a 16-bits encoding. This permits
265	over 65,000 characters. At present, over 25,000 positions are still
266	free. This form is also called UCS-2; i.e., Universal Character
267	Set 2-bytes. Unicode is the first plane of ISO 10646 (see below);
268	this plane is also called BMP (Basic Multilingual Plane) or Plane
269	Zero. The Internationalization of the Hypertext Markup Language
270	[I-HTML] proposes Unicode as the document character set.

272	ISO 10646 is a 32-bits encoding. It is divided into 32,000 planes,
273	each with 65,000 characters capacity. This permits 2,080 million
274	characters. This form is also called UCS-4, Universal Character
275	Set 4-bytes. Only the first plane (Unicode) is in use.

277	UTF-8, (Universal Character Set Transformation Format) is an addendum
278	to ISO 10646. It provides compatibility with ASCII and the ASCII
279	characters are represented by 1 byte (8 bits) and not 4 bytes (32
280	bits). In general, it is economical with the bytes used in the
281	encoding.

283	[HTTP-1.1] allows for the character set to be negotiated. For
284	example, the client and server can agree on using Unicode.

286	2.2 Front office
287	Rendering is drawing the glyphs (graphic representation of the
288	characters) on the screen or printer. This is the job of the browser
289	and the browser depends on the graphical facilities of the computer.

291	Undisplayable characters are the characters that cannot be displayed
292	due to the lack of facilities. The I-HTML "does not prescribe any
293	specific behavior", but notes some "considerations". WInter recommends
294	the following:

296	-  The behavior of undisplayable characters must be controlled by
297	the options setting of the browser

299	-  Some options can be combined.

301	-  There must be a small Undisplayable Characters Flag in the
302	browser part of the screen, not in the document part. Something
303	similar to the red button indicating that the browser is loading
304	a document, but smaller. The flag must be ON if the current document
305	contains one or more undisplayable characters. The presence or
306	absence of the flag must be user definable.

308	-  Undisplayable Character Tolerance is a user definable value in
309	the range from 0 to 10, that signals the behavior of the browser.

311	-  0 Undisplayable Character Tolerance means ignore all undisplayable
312	characters.

314	-  5 Undisplayable Character Tolerance means a reasonable default
315	warning for undisplayable characters. This behaviour must be defined.
316	For example, show only up to 10 continuous undisplayable characters
317	and try remaps, such as "e'" to "e".

319	-  10 Undisplayable Character Tolerance means show one Replacement
320	Glyph for each undisplayable character.

322	-  The other intermediary values must change gradually.

324	-  Undefined Undisplayable Character Tolerance must gravitate
325	towards the default value (5).

327	-  The undisplayable characters must be remapable to a user definable
328	Replacement Glyph for example, "_". Or one of several numeric
329	representations; for example, hexadecimal or decimal.

331	-  The default Replacement Glyph must occupy approximately the same
332	space as the average glyph in the document. It must be a box
333	containing the Unicode value in hex.

335	Font Servers could supply the browser with missing glyphs.

337	2.3 Multilingual typography
338	{The proposition of Martin Dvrst will be resumed here.}

340	2.4 The characters in the URL
341	The characters allowed in the URL are a subset of ASCII. URL where
342	supposed to be hidden, but they are very visible and important
343	commercially: firms want to spell their names with accents. The
344	most urgent is to have a large character set for the query part.
345	There have been propositions on using UTF-8. URL needs a lot of
346	work.

348	3. Internationalization & localization

350	Internationalized softwares are developed without the cultural
351	characteristics embedded. They can be localized parametrically for
352	different cultures; for example, the same software can run for
353	Germany with the German conventions, or for Italy with the Italian
354	conventions.

356	Internationalization is a well known field; for example, a significant
357	amount of effort was done during the POSIX (Unix) standardization.
358	The mechanisms must be sufficient for implementing the localizations.
359	Localization itself is usually discussed in other fora; for example,
360	how to represent the date in Germany. Most conventions have been
361	already agreed.

363	Any number of cultures (real or imaginary) are possible. For example,
364	France, Germany, European Commission. In the case of the European
365	Commission, it has to work in the eleven official languages (including
366	Greek), and with cross-cultural conventions or with the national
367	conventions.

369	3.1 Elements of localization
370	Languages
371	Two aspects:

373	-  Language strings in the software.

375	-  Data in the document.

377	Example, the software could be in German and the document shown in
378	French.

380	Sorting order
381	Number representation
382	Example, the internal number could be 12345.67 and the external
383	representation could be 12,345.67 or 12.345,67.

385	Date & Time
386	Example, the internal representation could be 19951231 and the
387	external representation could be December 31th 1995, or 31-12-1995.

389	Short quotations
390	Example,

392	-  "I am a Berliner" (English)

394	-  <<Je suis un Berlinois>> (French)

396	-  ,,Ich bin ein Berliner'' (German)

398	The new element <Q> in I-HTML is for this purpose.

400	New internationalization elements should be added to this list,
401	for example, color.

403	The software should be localized from a list of preferred localization,
404	and switchable from one localization to another without re-starting
405	the application.

407	3.2 Messages as HTML pages
408	The Status-Code and the Reason-Phrase (see 6.1.1, HTTP-1.1) are
409	presented as HTML pages. These are Language strings in the software
410	but are usually presented as data documents. For example, 404: Not
411	Found.

413	The localization of the Reason-Phrase can be done by the client or
414	the server. If the client can do a better job, it has to drop the
415	page sent by the server and generate the localized page from the
416	Status-Code and the LANG tag.

418	4. Multilinguism

420	Multilinguism deals with advanced language facilities, often several
421	languages simultaneously. It is also referred as Language Engineering.
422	This comes from the tradition of specialized software for Language
423	Engineering, such as Translator's Workbench. One of the main
424	applications is the processing of Parallel Texts.

426	Most of the softwares in Language Engineering are incompatible and
427	there are practically no standards in this field. Usually, researchers
428	or vendors start from scratch and develop all the modules; even
429	horizontal modules such as user interfaces and data structures,
430	rather than concentrate in the engines for language processing (for
431	aiding the translator, machine translation, etc).

433	One of the main inmediate objective in Language Engineering must
434	be the creation of standards that clearly separate data and software;
435	i.e., it should be possible to adquire a translation aid program
436	from one vendor and the dictionaries from another vendor.

438	The purpose is not making every browser a Translator's Workbench,
439	though browsers could do with more advanced language facilities
440	that are usually found in internationalized products. But the
441	standards must allow the construction of Translator's Workbenches
442	based on the Web technology.

444	After security and the application for secure payment over the
445	Internet, Language Engineering is one of the applications most
446	relevant from an economical point of view; in intranets, with less
447	security requirements, it is probably the most important. It is as
448	horizontal as publishing and, indeed, it is the second phase in
449	the ATP-chain (Author-Translator-Publisher). Translating is expensive
450	and very human intensive. For most texts, machine translation is
451	not acceptable. On the other hand, translating aiding tools are
452	very cost effective. Particularly, if integrated in an ATP-chain.
453	Saving in translating tends to be big.

455	5. Parallel Hypertext

457	5.1 Definition
458	Parallel Hypertext is an extension of the hypertext paradigm to
459	natural languages. For example, a user looking at a document in
460	English should be able to obtain the Spanish version in a transparent
461	way; i.e., just by selecting the Spanish option in a language menu
462	and not by selecting a link embedded in the English version. For
463	this, the Web must know about languages; i.e., the same in another
464	language. The same property of alignedness in Parallel Texts applied
465	to Parallel Hypertext.

467	5.2 Language tags
468	The language tags (see 3.10, HTTP-1.1) are composed of a primary
469	language tag and one or more subtags that could be empty.

471	Examples:

473	en
474	en-US
475	en-cockney

477	There must be a way to indicate

479	-  Human translation

481	-  Machine translation

483	-  Transliteration

485	This could be part of a subtag or inside the document.
486	{Examples will be added.}

488	5.3 Document request
489	Clients should be able to request documents at least in the following
490	ways:

492	-  A document is requested according to a preference language list
493	that could be the same list used for choosing the display labels
494	in the user interface. The server must respond with best linguistic
495	version and the list of available linguistic versions. The best
496	linguistic version means the nearer to the top of the list and if
497	none is available, the nearer to the top of the defaults in the
498	server. In this case, the browser probably does not know what are
499	the available linguistic versions.
500	{This will be developed.}

502	-  A document is requested in one specific language. The server
503	must respond only with that linguistic version (no other is
504	acceptable) and the list of available linguistic versions. In this
505	case, the client probably knows that the requested version is
506	available; it could be the result of a previous conversation with
507	the server.

509	Example:

511	-  Conversation 1
512	Client : Give me MyDoc with this order of preference: Danish,
513	English or German
514	Server : Take MyDoc in German; it is available in German, Italian and Spanish

516	-  Conversation 2
517	Client : Give me MyDoc only in Spanish
518	Server : Take MyDoc in Spanish; it is available in German, Italian
519	and Spanish

521	The linguistic versions of the document could be in different servers.

523	This could be done with the Accept-Language and Content-Language
524	facilities (see 10.4 and 10.11, HTTP-1.1).

526	The parameter in Accept-Language:

528	Quality factor "q" is decribed as "... estimate of the user's
529	comprehension of that language ..." . But the user indicates his
530	language preference list and there is no need to use the parameter
531	with this meaning. It would be more usefull to indicate the "minimum
532	acceptable quality of the translation". Some of the translation
533	could be done by more or less experienced translators; or machine
534	translation.

536	A different usage could be to indicate the level of alignedness.

538	Maximum acceptable size "mxb" is not used. It could indicate the
539	number of linguistic versions desired.

541	An Accept-Language with a single language parameter must mean that
542	the browser only wants that linguistic version and not another.

544	The Content-Language "... describes the natural language(s) of the
545	intended audience ...". The meaning of this field should be "the
546	list of linguistic versions available"; it should be used by the
547	browser to update the language menu, so the user could know which
548	other linguistic versions are available.

550	5.4 Parallel Hypertext Data Structure (PHDS)
551	One Parallel Hypertext Data Structure contains all the information
552	for one Parallel Hypertext Document. The Parallel Hypertext Data
553	Structure must allow the following:

555	-  Several data schemes. For example, directory, SGML, tar, etc

557	-  Keeping the linguistic versions in different servers

559	-  Conversation with monolingual clients. In this case, the user
560	must know the structure

562	The Parallel Hypertext Data Structure has two parts:

564	The PHDS-Header
565	Contains administrative data. For example, where is the German
566	linguistic version. The data is divided into structured fields.

568	The PHDS-Body
569	Contains the linguistic data. It has one section per language.

571	The PHDS-Header is always a HTML file. This file must fulfill two
572	functions:

574	-  Allowing a user to select one linguistic version

576	-  Be used by WIntered Web programs (clients/servers) as a
577	datastructure to locate the pertinent linguistic version

579	The PHDS-Header must contain at least the following information:

581	-  Name

583	-  DataScheme

585	-  DataLocation (for all the parts)

587	The DataSchema applies only to the PHDS-Body. The PHDS-Header is
588	always a HTML.

590	{An example of a file in HTML will be added.}

592	The default for a single set of files is:

594	DocName.html                              (PHDS-Header)

596	DocNameDir                                (PHDS-Body, a directory)
597	           /en.html             English   (PHDS-Body language section)
598	           /es.html             Spanish   (PHDS-Body language section)
599	           /de.html             German    (PHDS-Body language section)

601	The default for several sets of files is:

603	DocName.html                              (PHDS-Header)

605	DocNameDir                                (PHDS-Body, a directory)
606	           /en/DocName1.html    English   (PHDS-Body language section)
607	           /en/DocName2.html    English   (PHDS-Body language section)

609	           /es/DocName1.html    Spanish   (PHDS-Body language section)
610	           /es/DocName2.html    Spanish   (PHDS-Body language section)

612	           /de/DocName1.html    German    (PHDS-Body language section)
613	           /de/DocName2.html    German    (PHDS-Body language section)

615	The DocName.html should be usable directly by the present clients
616	(browsers) and/or indirectly to generate HTML files of the fly.
617	Multilingual clients should use the information to access the
618	documents in a transparent way.

620	Requesting a URL of a PHDS-Header must get the linguistic version
621	according to the rules of the language preferences. Requesting a
622	URL of a PHDS-Body language section must get that linguistic version.

624	The server must know at least the following defaults:

626	-  language with the explicit links

628	-  preferred language list

630	-  MAT table

632	{This will be extended.}

634	A standard data structure for Parallel Hypertext would be of use
635	for anybody working with Parallel Texts, independently if the Web
636	is used or not. For example, CD-ROMs could be published with Parallel
637	Texts for language processing programs, such as Machine Translation,
638	that would know what to expect. At present, there is no standard
639	for Parallel Texts or MAT.

641	The relation with Text Encoding Initiative (TEI) will be explored.

643	5.5 Linking strategy
644	The linking strategy must minimize the maintenance. This is essential
645	for large multilingual documentary databases. For example, the
646	millions of pages of the European Institutions in eleven languages.
647	Only one linguistic version should have explicit links; i.e., the
648	links as used today that are physically present in the documents.
649	The other linguistic versions would have implicit links; i.e. links
650	that would not be physically present in the texts, but they could
651	be calculated by the alignedness of the different linguistic
652	versions.

654	The generation of implicit links could be client, server and/or
655	authoring affair:

657	-  Client.- A client could receive a linguistic version with explicit
658	links and a linguistic version with implicit links. The client
659	would display the linguistic version with the explicit links or it
660	would calculate the implicit links on the fly and display the
661	result.

663	-  Server.- A multilingual server could process documents with
664	implicit links and generate on fly documents with explicit links.

666	-  Authoring.- An interactive or batch authoring system could
667	process documents with implicit links and it could create new
668	documents with explicit links; the server would not know how the
669	new documents were created.

671	These options should be considered as a continuum and (some) are
672	not mutually exclusive: most degrees between the extremes are
673	possible. For example, servers could be able to create documents
674	on the fly and they could be using documents with the links generated
675	by authoring systems. Indeed, a mixture could be the most probable
676	case.

678	The level of alignedness should be calculated in advance and kept
679	in the Parallel Hypertext Data Structure. Some documents widely
680	regarded as aligned because they were revised over half a dozen
681	time and they have been heavily used for decades (best-case
682	documents); once submitted to a computer program, it came to light
683	that they were not aligned even to paragraph level.

685	The linked text (i.e., what goes between <a ...> and </a>) would
686	have to be at least to the level to which the texts are aligned.
687	For example, for texts aligned only at paragraph level, it is not
688	possible to calculate implicit links at sentence level. A corollary
689	is that texts aligned at document level can have implicit links
690	only at the beginning or at the end.

692	The links would have to be at least at sentence level. It would be
693	hard to place implicit links in part of a sentence without tagging:
694	the second text should have null links; named null links if there
695	are several in one sentence.

697	Examples:

699	-  No need for null links in the second text. A whole sentence is
700	linked in the first text and finding the place for the implicit
701	links in the second text is easy.

703	The white table. <a href="MyURL"> The black table </a> The green table.
704	La mesa blanca.                   La mesa negra.       La mesa verde.
705	                 (implicit link)
706	-  It needs a null link in the second text. Only part of a sentence
707	is linked in the first text and finding the place for the implicit
708	link in the second text is hard; i.e., it cannot be done with simple
709	strings processing and it needs computational linguistics.

711	The white table. The black <a href="MyURL"> table </a> The green table.
712	La mesa blanca.  La <a name="Null"> mesa </a> negra.   La mesa verde.
713	                     (null link)

715	5.6 Generation of parallel texts
716	The linguistic versions could be generated through machine translation
717	or other techniques. For example, a system could have documents in
718	Spanish and a program for translation to English. The user should
719	be informed by the language menu into which languages and with
720	which techniques (MT, human translator, etc) the documents are
721	available.

723	{This will be extended.}

725	5.6.1 Language dependent strings
726	These are tags to be replaced by language string (Linguistic Object)
727	according to the language requested. For example, the following
728	shows the content of a HTML document and the resulting replacement;
729	assuming that the language requested is German and that the Linguistic
730	Object corresponding to the identifier String_1 is the German phrase
731	below:

733	 <SomeTag SomeLabel=String_1>

735	 Ich bin ein Berliner

737	5.6.2 Language-void document
738	A document without any language string; i.e., it contains only
739	language dependent strings. In this case, only one HTML document
740	is needed and not one per language; this HTML document could be
741	considered a mask. A database with Linguistic Objects is needed.
742	The same Linguistic Object can be used in several documents.

744	This technique could be used for the localization of the messages send by the server as HTML documents.

746	6. Bidirectionallity (BIDI)

748	(see 4.2, I-HTML)
749	{A resume from the I-HTML will be inserted.}
750	7. The LANG attribute

752	(see 3, I-HTML)
753	{A resume from the I-HTML will be inserted.}

755	8. LINKs

757	<LINK REL=Glossary>
758	<LINK REL=Dictionary>
759	<LINK REL=Translation>
760	{This will be exteneded.}

762	9. Multilingual thesaurus

764	This is a tool for finding references to the search in any language.
765	For example, if the string in the search is "table" it should also
766	find the Spanish document with the word "mesa" (table in Spanish).

768	10. Electronic Data Internchange (EDI)

770	Many EDI messages are printed. As the EDI messages are very
771	structured, a translation of the message could be shown using
772	Pseudo-Automatic Translation (PAT).

774	11. Passing selected text to a CGI

776	To consult terminological databases easly, it should be possible
777	to pass selected string (with the mouse or other) to CGI programs
778	or similar. This is a generic mechanism.

780	12. Reference model for Internationalization & Multilinguism

782	This is a very first trial and further work is needed. The model
783	is layered, similar to the seven layers ISO Reference Model (physical,
784	datalink, network, etc). A different approach could be needed; for
785	example, a vector approach.

787	LayerNumber   LayerName         Example

789	1             compression       gzip
790	2             transformation    UTF-8
791	3             character set     Unicode (65, "LATIN CAPITAL LETTER A")
792	4             glyph             "A"
793	5             font              Time

795	Other items to put into the model:

797	-  sorting order

799	-  language (e.g., Korean)

801	There is a general tendency to overload the character set layer.
802	For example, wishing to allocate two code positions to the same
803	ideogram because it means different things in different languages.

805	13. VRML

807	How objects negotiate when they speak different languages ?
808	{This will be developped.}

810	14. Java

812	{This will be developped.}

814	15. Dragoman

816	This section is included mostly to illustrate the kind of applications
817	for multilinguism.

819	Dragoman is a reference model for Language Engineering. It uses
820	Multilingual Aligned Hypertext technique. In essence, Dagroman
821	describes a Database (part structured and part documental) and
822	Services that can be implemented over the (multilingual ) Database.
823	Often, different data structures are used for the Services described
824	below.

826	The Web paradigm is particularly well adapted to Dragoman. The term
827	Dragoman has nothing to do with dragons; it means language interpreter.

829	What follows is a very brief description of some of the Services
830	that could be implemented over the Database. There could be several
831	programs offering the same Service. Services processing whole
832	documents could be implemented in batch; particularly if they are
833	using a very large Database (several gigabytes).

835	15.1 Interactive Search
836	Selects the Multilingual Aligned Texts (MAT) that match a search
837	criteria. The search is fuzzy (e.g. 87% match). Unfound requests
838	are valuable information that must be processed further. The system
839	must keep trace of the unfound requests to put in contact people
840	with similar needs (matchmaker); the user must decide what is a
841	typing error and what is a genuine unfound request. Also the user
842	can send messages to terminologists (demand driven terminology).

844	15.2 The Translation Folder (full preprocessing)
845	The objective is to obtain a complete Translation Folder for a
846	given document. Hence, the translator should not need to consult
847	dictionaries, databases, glossaries, nomenclature list, etc. It is
848	like having a hundred assistants preparing the text for the
849	translator. In a typical Translation Folder, some paragraphs should
850	be fully translated and some paragraphs should be a mixture of full
851	sentences, segments, titles, terms, nomenclatures, etc (all these
852	items are packaged as Linguistic Objects); background documents
853	could also be taken into account. The Linguistic Objects are marked
854	with the Status; for example, unverified, verified, compulsory,
855	etc. The search follows a fuzzy biggest chunk heuristic. Traditionally
856	there are two texts, source and target. But there could be any
857	number of language fields. This could be the most useful Service
858	for the translator and it should be implemented early. The translator
859	could use the result on paper or on the screen.

861	15.3 Preprocessing for Machine Translation
862	Similar to the Translation Folder. It should be adapted to an
863	(existing) machine translation program that follows up the processing.
864	For example, select only exact matches (no fuzzy) and terms in the
865	unfound phrases; the machine translation program would translate
866	only the unfound phrases.

868	15.4 Machine Translation
869	A Machine Translation program that uses the Database directly. For
870	example, a program could combine perfect matches, process the easy
871	fuzzy matches such as dates, pure Machine Translation, etc.

873	15.5 Pseudo-Automatic Translation (PAT)
874	Similar to the Translation Folder, but where all the texts are
875	found with a 100% match (no fuzzy search). The program should be
876	restricted to a collection of records; i.e., it should not be
877	allowed to roam the Database as there could be bad surprises. In
878	particular, one must avoid word by word translation; hence one must
879	be very careful with small Multilingual Aligned Texts (for example,
880	a one-word Multilingual Aligned Text).

882	15.6 Document Generation
883	All the linguistic versions of a document are generated camera
884	ready. There is no source and translation as such, the index is
885	created, the typesetting (nearly) done. This is the most useful
886	Service for the Organization. It is a very efficient way to produce
887	documents. The three phases Author-Translator-Publisher (ATP-chain)
888	are highly integrated. It is particularly adapted to periodic
889	publications. The production of standardized documents is trivial.

891	Documents in several linguistic versions are often required to be
892	synchronized; i.e., each page in each linguistic version must
893	contain the same content and the same lay-out (text, number of
894	paragraphs, etc). The typesetting, including the synchronization,
895	must be automated and each page should not be processed by a human;
896	a human operator should intervene only to fine-tune the publication.
897	TeX should be considered.

899	A document might need several representations; for example, typesetted
900	for the Official Journal, formatted for a CD-ROM or marked in HTML
901	(for CD-ROM or server). First, a document in SGML should be generated;
902	indeed, the SGML document is the document. All the following
903	representations should be created from the SGML document. This
904	method should guarantee that all the representations have the same
905	content.

907	With such a system in place, the creation of secondary products is
908	easy. For example, a Parliamentary Commission could work with a
909	draft of the Budget typesetted like the Official Journal, in all
910	the linguistic versions, enriched with hidden comments.

912	15.7 Document Comparison
913	The user directs the program to a document similar to the one that
914	has to be translated. The new pieces could be fetched in the
915	Database. This program could work without the Database, though the
916	new pieces would not be fetched. Similar translations could arise
917	as a version of a previous document and as a new similar document.

919	15.8 Author's Workbench
920	Authors could use a similar technique to Translation Folder and
921	Document Comparison. The unknown parts of the text would be marked
922	and in certain cases alternatives would be proposed. Texts created
923	with the translation phase in mind are easier to translate. Ideally,
924	the author should aim to produce a text for translation with
925	Pseudo-Automatic Translation.

927	15.9 Terminology Verification
928	The objective is to verify the Consistency and Harmonization of
929	the terminology. The concepts are closely related and they can be
930	combined, but they are not the same.

932	-  Consistency is naming the same object with the same term. It is
933	an internal characteristic of a set of documents (the unitary set
934	is allowed) and it does not need a Database. The more linguistic
935	versions of the set of documents the better.

937	-  Harmonization is imposing a term by the Terminological Authority.
938	It is an external characteristic of the document and it needs a
939	Database with the harmonized terms.

941	15.10 Multilingual Aligned Text Editor
942	An editor shows at least two (aligned) texts, it moves the texts
943	in sync, it highlights the differences, etc.

945	15.11 Printing
946	A program that prints one or several Multilingual Aligned Text side
947	by side. It could be the following step after the Translation
948	Folder. Multilingual Aligned Texts (source and target) on paper
949	allow the translator to use traditional tools such as dictating.

951	16. Acknowledgments

953	This document makes heavy use from the documents cited in the texts.
954	Particularly from the relevant RFC and IETF-drafts.

956	Also from the following:

958	-  Web Multilinguism. BOF meeting, Third International WWW Conference

960	-  Web Internationalization. BOF meeting, Fourth International WWW
961	Conference

963	-  Web Internationalization & Multilinguism. BOF meeting, Fifth
964	International WWW Conference

966	-  Internationalization Workshop. Fifth International WWW Conference

968	-  WInter mailing list

970	-  Informal talks/communications (probably the most fruitful)

972	The BOF meetings were organized by the author.

974	Martin Duerst made many suggestions to the position paper of the
975	author for the Internationalization Workshop during the Fifth
976	International WWW Conference. The present document is over 80%
977	based on the position paper. He commented the Reference model and
978	I expect him to come back with further suggestions.

980	In such fluid circumstances, it is nearly impossible to attribute
981	credits. Though it particularly comes to mind,

983	Bert Bos
984	Martin Bryan
985	Martin Dvrst
986	Albert Lunde
987	Larry Masinter
988	Gavin Nicol
989	Steven Pemberton
990	Christine Stark
991	Fran[ois Yergeau
992	Faith Zack

994	The author tries to look for consensus and borrowed heavily from
995	many sources. On the other hand, he is the only responsible for
996	any shortcomings and the opinions expressed.

998	17. Bibliography

1000	[BRIAN] Martin Bryan, "Using HyTime to Link Translations", contribution
1001	to the WInter mailing list,
1002	http://www.crpht.lu/~carrasco/winter/hytime.html

1004	[CARRASCO-1] M.T. Carrasco Benitez, "On the multilingual normalization
1005	of the Web", Poster for the Third International WWW Conference,
1006	http://www.crpht.lu/~carrasco/winter/poster.html

1008	[CARRASCO-2] M.T. Carrasco Benitez, "Web Internationalization",
1009	Poster for the Fourth International WWW Conference,
1010	http://www.crpht.lu/~carrasco/winter/inter.html

1012	[CARRASCO-3] M.T. Carrasco Benitez, "WInter (Web Internationalization
1013	& Multilinguism0", Position paper for the Internationalization
1014	Workshop during the Fifth International WWW Conference,
1015	http://www.crpht.lu/~carrasco/winter/popa.html

1017	[CONNOLLY] "Character Set Considered Harmful",
1018	http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html

1020	[HTML 2.0] T. Berners-Lee, D. Connolly, "HTML 2.0", RFC 1866,
1021	http://www.ics.uci.edu/pub/ietf/html/rfc1866.txt

1023	[HTML 3.0] "HTML 3.0", expired Internet-Draft,
1024	http://www.hpl.hp.co.uk/people/dsr/html3/CoverPage.html

1026	[HTTP-1.1] R.T. Fielding, H. Frystyk Nielsen, and T. Berners-Lee,
1027	"Hypertext Transfer Protocol -- HTTP/1.1", Work in progress
1028	(draft-ietf-http-v11-spec-01.txt) MIT/LCS, January 1996.
1029	http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-v11-spec-01.html,

1031	[I-HTML] F. Yergeau, G. Nicol, G. Adams, M. Duerts, "Internationalization
1032	of the Hypertext Markup Language", Work in progress,
1033	(draft-ietf-html-i18n-03.txt)
1034	http://www.alis.com:8085/ietf/html/draft-ietf-html-i18n.txt

1036	[ISO-8859-1] ISO 8859-1:1987. International Standard -- Information
1037	Processing -- 8-bit Single-Byte Coded Graphic Character Sets --
1038	Part 1: Latin Alphabet No. 1.

1040	[NICOL] G. T. Nicol, "The Multilingual WWW"
1041	http://www.ebt.com:8080/docs/multilingual-www.html

1043	[UNICODE] The Unicode Consortium, "The Unicode Standard -- Worldwide
1044	Character Encoding -- Version 1.0", Addison-Wesley, Volume 1, 1991,
1045	Volume 2, 1992. http://www.unicode.org

1047	[ZACK] F. Zack, "Serving Multilingual Online Documentation", Poster
1048	for the Fifth International WWW Conference

1050	{This list will be completed.}

1052	18. Author Address

1054	Manuel Tomas CARRASCO BENITEZ
1055	carrasco@innet.lu
1056	http://www.crpht.lu/~carrasco/winter