< draft-ietf-acap-mlsf-00.txt   draft-ietf-acap-mlsf-01.txt >
Network Working Group C. Newman Network Working Group C. Newman
Internet Draft: Multi-Lingual String Format Innosoft Internet Draft: Multi-Lingual String Format Innosoft
Document: draft-ietf-acap-mlsf-00.txt May 1997 Document: draft-ietf-acap-mlsf-01.txt June 1997
Expires in six months Expires in six months
Multi-Lingual String Format (MLSF) Multi-Lingual String Format (MLSF)
Status of this memo Status of this memo
This document is an Internet Draft. Internet Drafts are working This document is an Internet Draft. Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its Areas, documents of the Internet Engineering Task Force (IETF), its Areas,
and its Working Groups. Note that other groups may also distribute and its Working Groups. Note that other groups may also distribute
working documents as Internet Drafts. working documents as Internet Drafts.
skipping to change at page 1, line 36 skipping to change at page 1, line 36
munnari.oz.au. munnari.oz.au.
A revised version of this draft document will be submitted to the A revised version of this draft document will be submitted to the
RFC editor as a Proposed Standard for the Internet Community. RFC editor as a Proposed Standard for the Internet Community.
Discussion and suggestions for improvement are requested. This Discussion and suggestions for improvement are requested. This
document will expire six months after publication. Distribution of document will expire six months after publication. Distribution of
this draft is unlimited. this draft is unlimited.
Abstract Abstract
While UTF-8 [UTF-8] solves most internationalization (I18N) The IAB charset workshop [IAB-CHARSET] concluded that for human
problems, it fails to solve multilingualization problems (M17N) readable text there should always be a way to specify the natural
problems. The two basic problems with UTF-8 are that CJK language. Many protocols are designed with an attribute-value
unification fails to recognize glyph style differences between model (including RFC 822, HTTP, LDAP, SNMP, DHCP, and ACAP) which
Chinese, Japanese and Korean and that it is impossible to read stores many small human readable text strings. The primary
UTF-8 text to a blind person without knowing the language. function of an attribute-value model is to simplify both
extensibility and searchability. A solution is needed to provide
language tags in these small human readable text strings, which
does not interfere with these primary functions.
Encoding language tagging in the coded character set itself can This specification defines MLSF (Multi-Lingual String Format) which
unnecessarily complicate processing which doesn't need language applies another layer of encoding on top of UTF-8 [UTF-8] to permit
tags. Encoding the language tagging at the application protocol the addition of language tags anywhere within a text string. In
level will add unnecessary complexity to every application protocol addition, it defines an alternate form which can be used to include
which needs multi-lingual support. In addition, such higher level alternative representations of the same text in different character
language support may fail to deal with mixed language strings and sets. MLSF has the property that UTF-8 is a proper subset of MLSF.
strings which have alternate representations in different This preserves the searchability requirement of the attribute-value
languages. model.
This specification uses unused octet sequences in UTF-8 as a Appendix F of this document includes a brief discussion of the
framework to build a new encoding called MLSF (Multi-Lingual String background behind MLSF and why some other potential solutions were
Format) which supports mixed language strings and alternative rejected for this purpose.
language strings. The goal is to make language tags easy to strip
when unnecessary, easy to support when necessary, and to preserve
the good searching characteristics of UTF-8 as much as possible.
1. Conventions used in this document 1. Conventions used in this document
The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY"
in this document are to be interpreted as defined in "Key words for in this document are to be interpreted as defined in "Key words for
use in RFCs to Indicate Requirement Levels" [KEYWORDS]. use in RFCs to Indicate Requirement Levels" [KEYWORDS].
2. MLSF simple form 2. MLSF simple form
MLSF uses "Tags for the Identification of Languages" [LANG-TAGS] as MLSF uses "Tags for the Identification of Languages" [LANG-TAGS] as
the basis for language identification. the basis for language identification.
Language tags are encoded by mapping them to upper-case, then Language tags are encoded by mapping them to upper-case, then
adding hexidecimal A0 to each octet. The result is broken up into adding hexadecimal A0 to each octet. The result is broken up into
groups of five octets followed by a final group of five or fewer groups of five octets followed by a final group of five or fewer
octets. Each group is prefixed by a UTF-8-style length count with octets. Each group is prefixed by a UTF-8-style length count with
the low bits set to 0. See Appendix D for sample source code to the low bits set to 0. See Appendix D for sample source code to
perform this conversion. perform this conversion.
MLSF simple form is UTF-8 with embedded MLSF language tags. An MLSF simple form is defined by the MLSF-SIMPLE rule in section 7.
important observation is that a UTF-8 interpreter which silently A quoted version of MLSF simple form is defined by the MLSF-
ignores illegal characters will successfully process MLSF simple SIMPLE-QUOTED rule.
form strings. MLSF simple form is defined by the MLSF-SIMPLE rule
in section 7. A quoted version of MLSF simple form is defined by Note that MLSF is not compatible with UTF-8. A program which uses
the MLSF-SIMPLE-QUOTED rule. MLSF MUST downconvert it to UTF-8 prior to using it in a context
where UTF-8 is required. Sample code for this down conversion is
included in Appendix B.
3. MLSF alternative form 3. MLSF alternative form
A MLSF alternative form string may contain alternative A MLSF alternative form string may contain alternative
representations of the same text in different primary languages. representations of the same text in different primary languages.
The octet with hexidecimal representation of FE is used to The octet with hexadecimal representation of FE is used to
introduce a new alternative. This MUST be followed by a MLSF introduce a new alternative. This MUST be followed by a MLSF
language tag for the primary language of the alternative. language tag for the primary language of the alternative.
The component of the MLSF string prior to the first FE octet is The component of the MLSF string prior to the first FE octet is
considered the "preferred" representation for the string. This is considered the "preferred" representation for the string. This is
the version which will be displayed by MLSF clients which choose the version which will be displayed by MLSF clients which choose
not to support alternative representations. The preferred not to support alternative representations. The preferred
representation MAY be prefixed by a MLSF language tag. representation MAY be prefixed by a MLSF language tag.
MLSF alternate form is defined by the MLSF-ALT rule in section 7. MLSF alternate form is defined by the MLSF-ALT rule in section 7.
A quoted version of MLSF alternate form is defined by the A quoted version of MLSF alternate form is defined by the
MLSF-ALT-QUOTED rule. MLSF-ALT-QUOTED rule.
4. Minimal Support: downconverting MLSF to UTF-8 Note that MLSF alternate form is not compatible with UTF-8. A
program which uses MLSF MUST downconvert it to UTF-8 prior to using
Minimal support for MLSF requires the ability to downconvert MLSF it in a context where UTF-8 is required. Sample code for this down
to UTF-8. This is a simple procedure which selects the preferred conversion is included in Appendix B.
alternative and strips all language tags. Sample code is included
in Appendix B. All UTF-8 strings which do not contain a 0 octet
are also MLSF strings.
5. MLSF MIME character sets 4. MLSF MIME character sets
The character set label "XXXX-simple" has been registered to The character set label "XXXX-simple" will be registered to
indicate the use of MLSF simple form. The character set label indicate the use of MLSF simple form. The character set label
"XXXX-alt" has been registered to indicate the use of MLSF "XXXX-alt" will be registered to indicate the use of MLSF alternate
alternate form. form.
MLSF may be used in conjunction with MIME header [MIME-HDR] MLSF may be used in conjunction with MIME header [MIME-HDR]
encoding to permit language tagging and alternative representations encoding to permit language tagging and alternative representations
in header fields. in header fields. A work in progress [MIME-LANG] will propose a
mechanism for language tagging in headers which is not dependent on
the use of UTF-8.
For single language MIME body parts, the UTF-8 character set with For single language MIME body parts, the UTF-8 character set with
an appropriate Content-Language [LANG-TAG] header SHOULD be used an appropriate Content-Language [LANG-TAG] header SHOULD be used
instead of MLSF. instead of MLSF. Text/enriched [ENRICHED] or HTML with language
tags [HTML-I18N] are preferred to using MLSF for MIME bodies when
possible.
6. Security Considerations 5. Security Considerations
Multi-Lingual String Format is not believed to have any security Multi-Lingual String Format is not believed to have any security
considerations beyond those for simple US-ASCII strings. In considerations beyond those for simple US-ASCII strings. In
particular, unfiltered display of certain US-ASCII control particular, unfiltered display of certain US-ASCII control
characters by a terminal emulator may result in modifying the characters by a terminal emulator may result in modifying the
behavior of the terminal emulator (e.g. by redefining function behavior of the terminal emulator (e.g. by redefining function
keys) such that security can be breached. Programs which display keys) such that security can be breached. Programs which display
text to a potentially insecure terminal emulator channel are text to a potentially insecure terminal emulator channel are
encouraged to remove control characters to avoid these problems. encouraged to remove control characters to avoid these problems.
7. Formal Grammar 6. Formal Grammar
This section defines the formal grammar for MLSF using Augmented This section defines the formal grammar for MLSF using Augmented
BNF [ABNF] notation. BNF [ABNF] notation.
MLSF-ALT = [[MLSF-LANG-TAG] MLSF-COMPONENT MLSF-ALT = [[MLSF-LANG-TAG] MLSF-COMPONENT
*(MLSF-ALTERNATE MLSF-COMPONENT)] *(MLSF-ALTERNATE MLSF-COMPONENT)]
MLSF-ALT-QUOTED = <"> [[MLSF-LANG-TAG] MLSF-COMPONENT-Q MLSF-ALT-QUOTED = <"> [[MLSF-LANG-TAG] MLSF-COMPONENT-Q
*(MLSF-ALTERNATE MLSF-COMPONENT-Q)] <"> *(MLSF-ALTERNATE MLSF-COMPONENT-Q)] <">
skipping to change at page 4, line 49 skipping to change at page 4, line 49
MLSF-SIMPLE = [[MLSF-LANG-TAG] MLSF-COMPONENT] MLSF-SIMPLE = [[MLSF-LANG-TAG] MLSF-COMPONENT]
MLSF-SIMPLE-QUOTED = <"> [[MLSF-LANG-TAG] MLSF-COMPONENT-Q] <"> MLSF-SIMPLE-QUOTED = <"> [[MLSF-LANG-TAG] MLSF-COMPONENT-Q] <">
QUOTED = "\" QUOTED-SPECIAL QUOTED = "\" QUOTED-SPECIAL
QUOTED-SPECIAL = "\" / <"> QUOTED-SPECIAL = "\" / <">
US-ASCII-SAFE = %x01..09 / %x0B..0C / %x0E..21 US-ASCII-SAFE = %x01..09 / %x0B..0C / %x0E..21
/ %x23..2E / %x30..7F / %x23..5B / %x5D..7F
;; US-ASCII except QUOTED-SPECIALs, CR, LF, NUL ;; US-ASCII except QUOTED-SPECIALs, CR, LF, NUL
UTF8-NON-NUL = UTF8-SAFE / CR / LF / QUOTED-SPECIAL UTF8-NON-NUL = UTF8-SAFE / CR / LF / QUOTED-SPECIAL
UTF8-QUOTED = UTF8-SAFE / QUOTED UTF8-QUOTED = UTF8-SAFE / QUOTED
UTF8-SAFE = US-ASCII-SAFE / UTF8-1 / UTF8-2 / UTF8-3 UTF8-SAFE = US-ASCII-SAFE / UTF8-1 / UTF8-2 / UTF8-3
/ UTF8-4 / UTF8-5 / UTF8-4 / UTF8-5
UTF8-CONT = %x80..BF UTF8-CONT = %x80..BF
UTF8-1 = %xC0..DF UTF8-CONT UTF8-1 = %xC0..DF UTF8-CONT
UTF8-2 = %xE0..EF 2UTF8-CONT UTF8-2 = %xE0..EF 2UTF8-CONT
UTF8-3 = %xF0..F7 3UTF8-CONT UTF8-3 = %xF0..F7 3UTF8-CONT
UTF8-4 = %xF8..FB 4UTF8-CONT UTF8-4 = %xF8..FB 4UTF8-CONT
UTF8-5 = %xFC..FD 5UTF8-CONT UTF8-5 = %xFC..FD 5UTF8-CONT
8. References 7. References
[ABNF] Crocker, D., "Augmented BNF for Syntax Specifications: [ABNF] Crocker, D., "Augmented BNF for Syntax Specifications:
ABNF", Work in progress: draft-ietf-drums-abnf-xx.txt ABNF", Work in progress: draft-ietf-drums-abnf-xx.txt
[ENRICHED] Resnick, Walker, "The text/enriched MIME Content-type",
RFC 1896, Qualcomm, InterCon, February 1996.
<ftp://ds.internic.net/rfc/rfc1896.txt>
[HTML-I18N] Yergeau, Nicol, Adams, Duerst, "Internationalization of
the Hypertext Markup Language", RFC 2070, Alis Technologies,
Electronic Book Technologies, Spyglass, University of Zurich,
January 1997.
<ftp://ds.internic.net/rfc/rfc2070.txt>
[IAB-CHARSET] Weider, Preston, Simonsen, Alvestrand, Atkinson,
Crispin, Svanberg, "The Report of the IAB Character Set Workshop
held 29 February - 1 March, 1996", RFC 2130, April 1997.
<ftp://ds.internic.net/rfc/rfc2130.txt>
[IMAP4] Crispin, "Internet Message Access Protocol - Version
4rev1", RFC 2060, University of Washington, December 1996.
<ftp://ds.internic.net/rfc/rfc2060.txt>
[KEYWORDS] Bradner, "Key words for use in RFCs to Indicate [KEYWORDS] Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", RFC 2119, Harvard University, March 1997. Requirement Levels", RFC 2119, Harvard University, March 1997.
<ftp://ds.internic.net/rfc/rfc2119.txt> <ftp://ds.internic.net/rfc/rfc2119.txt>
[LANG-TAGS] Alvestrand, H., "Tags for the Identification of [LANG-TAGS] Alvestrand, H., "Tags for the Identification of
Languages", RFC 1766. Languages", RFC 1766.
<ftp://ds.internic.net/rfc/rfc1766.txt> <ftp://ds.internic.net/rfc/rfc1766.txt>
skipping to change at page 6, line 5 skipping to change at page 6, line 27
2047, University of Tennessee, November 1996. 2047, University of Tennessee, November 1996.
<ftp://ds.internic.net/rfc/rfc2047.txt> <ftp://ds.internic.net/rfc/rfc2047.txt>
[MIME-IMB] Freed, Borenstein, "Multipurpose Internet Mail [MIME-IMB] Freed, Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Bodies", RFC Extensions (MIME) Part One: Format of Internet Message Bodies", RFC
2045, Innosoft, First Virtual, November 1996. 2045, Innosoft, First Virtual, November 1996.
<ftp://ds.internic.net/rfc/rfc2045.txt> <ftp://ds.internic.net/rfc/rfc2045.txt>
[MIME-LANG] Freed, Moore, "MIME Parameter Value and Encoded Words:
Character Sets, Language, and Continuations", work in progress,
March 1997.
[UTF8] Yergeau, F. "UTF-8, a transformation format of Unicode and [UTF8] Yergeau, F. "UTF-8, a transformation format of Unicode and
ISO 10646", RFC 2044, Alis Technologies, October 1996. ISO 10646", RFC 2044, Alis Technologies, October 1996.
<ftp://ds.internic.net/rfc/rfc2044.txt> <ftp://ds.internic.net/rfc/rfc2044.txt>
9. Acknowledgements 8. Acknowledgements
Special thanks to Mark Crispin for the idea of using unused UTF-8 Special thanks to Mark Crispin for the idea of using unused UTF-8
codes for this purpose. Thanks are also due to participants of codes for this purpose. Thanks are also due to participants of
the ACAP WG mailing list who helped review this proposal. the ACAP WG mailing list who helped review this proposal.
10. Author's Address 9. Author's Address
Chris Newman Chris Newman
Innosoft International, Inc. Innosoft International, Inc.
1050 East Garvey Ave. South 1050 East Garvey Ave. South
West Covina, CA 91790 USA West Covina, CA 91790 USA
Email: chris.newman@innosoft.com Email: chris.newman@innosoft.com
Appendix A. Client advice Appendix A. Client advice
skipping to change at line 504 skipping to change at page 13, line 38
} }
/* skip to next MLSF component */ /* skip to next MLSF component */
while (*str != '\0' && *str++ != 0xFEU) while (*str != '\0' && *str++ != 0xFEU)
; ;
} while (*str != '\0'); } while (*str != '\0');
} }
return (best); return (best);
} }
Appendix F. Background and Alternate Solutions
MLSF was designed to deal with language tagging in the context of
the ACAP protocol, but is believed to be useful in other contexts.
Specific scenarios cited during discussion were human names in
address books, system administrator alert error messages, and error
messages which include identifiers potentially in a different
language from the client's preferred error message language. Since
ACAP is an arbitrary attribute-value protocol, it is impossible to
imaging all possible scenarios in advance, so a general purpose
mechanism was needed.
There have been several attempts to solve language tagging in
attribute value protocols. RFC 822 poses a particularly
troublesome scenario, since headers must be 7-bit. The MIME
solution to label character sets [MIME-HDR] and languages [MIME-
LANG] in headers is thus a necessary evil. The result of this is
to make header searching services such as those provided by IMAP
[IMAP4] massively more complex. If 8-bit headers were permitted a
solution like MLSF would have been far simpler and more efficient.
Another approach taken is demonstrated by the current vCard,
iCalendar, and LDAPv3 proposals (all works in progress). These
proposals overload the attribute namespace to provide language
tagging and creates a concept roughly described as attributes of
the attribute. The result of this is that clients have to deal
with a multiple attribute response to a query where each attribute
may have multiple values. The additional complexity this adds to
client processing was deemed unacceptable for ACAP where client
simplicity was an important design goal.
Another possible approach is the use of a markup language such as
text/enriched [ENRICHED]. While this is certainly a suitable
language tagging solution for large text objects such as MIME
bodies, it is unsuitable for the attribute-value model where
searching is a primary function.
 End of changes. 22 change blocks. 
48 lines changed or deleted 78 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/