| < draft-ietf-acap-mlsf-00.txt | draft-ietf-acap-mlsf-01.txt > | |||
|---|---|---|---|---|
| Network Working Group C. Newman | Network Working Group C. Newman | |||
| Internet Draft: Multi-Lingual String Format Innosoft | Internet Draft: Multi-Lingual String Format Innosoft | |||
| Document: draft-ietf-acap-mlsf-00.txt May 1997 | Document: draft-ietf-acap-mlsf-01.txt June 1997 | |||
| Expires in six months | Expires in six months | |||
| Multi-Lingual String Format (MLSF) | Multi-Lingual String Format (MLSF) | |||
| Status of this memo | Status of this memo | |||
| This document is an Internet Draft. Internet Drafts are working | This document is an Internet Draft. Internet Drafts are working | |||
| documents of the Internet Engineering Task Force (IETF), its Areas, | documents of the Internet Engineering Task Force (IETF), its Areas, | |||
| and its Working Groups. Note that other groups may also distribute | and its Working Groups. Note that other groups may also distribute | |||
| working documents as Internet Drafts. | working documents as Internet Drafts. | |||
| skipping to change at page 1, line 36 ¶ | skipping to change at page 1, line 36 ¶ | |||
| munnari.oz.au. | munnari.oz.au. | |||
| A revised version of this draft document will be submitted to the | A revised version of this draft document will be submitted to the | |||
| RFC editor as a Proposed Standard for the Internet Community. | RFC editor as a Proposed Standard for the Internet Community. | |||
| Discussion and suggestions for improvement are requested. This | Discussion and suggestions for improvement are requested. This | |||
| document will expire six months after publication. Distribution of | document will expire six months after publication. Distribution of | |||
| this draft is unlimited. | this draft is unlimited. | |||
| Abstract | Abstract | |||
| While UTF-8 [UTF-8] solves most internationalization (I18N) | The IAB charset workshop [IAB-CHARSET] concluded that for human | |||
| problems, it fails to solve multilingualization problems (M17N) | readable text there should always be a way to specify the natural | |||
| problems. The two basic problems with UTF-8 are that CJK | language. Many protocols are designed with an attribute-value | |||
| unification fails to recognize glyph style differences between | model (including RFC 822, HTTP, LDAP, SNMP, DHCP, and ACAP) which | |||
| Chinese, Japanese and Korean and that it is impossible to read | stores many small human readable text strings. The primary | |||
| UTF-8 text to a blind person without knowing the language. | function of an attribute-value model is to simplify both | |||
| extensibility and searchability. A solution is needed to provide | ||||
| language tags in these small human readable text strings, which | ||||
| does not interfere with these primary functions. | ||||
| Encoding language tagging in the coded character set itself can | This specification defines MLSF (Multi-Lingual String Format) which | |||
| unnecessarily complicate processing which doesn't need language | applies another layer of encoding on top of UTF-8 [UTF-8] to permit | |||
| tags. Encoding the language tagging at the application protocol | the addition of language tags anywhere within a text string. In | |||
| level will add unnecessary complexity to every application protocol | addition, it defines an alternate form which can be used to include | |||
| which needs multi-lingual support. In addition, such higher level | alternative representations of the same text in different character | |||
| language support may fail to deal with mixed language strings and | sets. MLSF has the property that UTF-8 is a proper subset of MLSF. | |||
| strings which have alternate representations in different | This preserves the searchability requirement of the attribute-value | |||
| languages. | model. | |||
| This specification uses unused octet sequences in UTF-8 as a | Appendix F of this document includes a brief discussion of the | |||
| framework to build a new encoding called MLSF (Multi-Lingual String | background behind MLSF and why some other potential solutions were | |||
| Format) which supports mixed language strings and alternative | rejected for this purpose. | |||
| language strings. The goal is to make language tags easy to strip | ||||
| when unnecessary, easy to support when necessary, and to preserve | ||||
| the good searching characteristics of UTF-8 as much as possible. | ||||
| 1. Conventions used in this document | 1. Conventions used in this document | |||
| The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" | The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" | |||
| in this document are to be interpreted as defined in "Key words for | in this document are to be interpreted as defined in "Key words for | |||
| use in RFCs to Indicate Requirement Levels" [KEYWORDS]. | use in RFCs to Indicate Requirement Levels" [KEYWORDS]. | |||
| 2. MLSF simple form | 2. MLSF simple form | |||
| MLSF uses "Tags for the Identification of Languages" [LANG-TAGS] as | MLSF uses "Tags for the Identification of Languages" [LANG-TAGS] as | |||
| the basis for language identification. | the basis for language identification. | |||
| Language tags are encoded by mapping them to upper-case, then | Language tags are encoded by mapping them to upper-case, then | |||
| adding hexidecimal A0 to each octet. The result is broken up into | adding hexadecimal A0 to each octet. The result is broken up into | |||
| groups of five octets followed by a final group of five or fewer | groups of five octets followed by a final group of five or fewer | |||
| octets. Each group is prefixed by a UTF-8-style length count with | octets. Each group is prefixed by a UTF-8-style length count with | |||
| the low bits set to 0. See Appendix D for sample source code to | the low bits set to 0. See Appendix D for sample source code to | |||
| perform this conversion. | perform this conversion. | |||
| MLSF simple form is UTF-8 with embedded MLSF language tags. An | MLSF simple form is defined by the MLSF-SIMPLE rule in section 7. | |||
| important observation is that a UTF-8 interpreter which silently | A quoted version of MLSF simple form is defined by the MLSF- | |||
| ignores illegal characters will successfully process MLSF simple | SIMPLE-QUOTED rule. | |||
| form strings. MLSF simple form is defined by the MLSF-SIMPLE rule | ||||
| in section 7. A quoted version of MLSF simple form is defined by | Note that MLSF is not compatible with UTF-8. A program which uses | |||
| the MLSF-SIMPLE-QUOTED rule. | MLSF MUST downconvert it to UTF-8 prior to using it in a context | |||
| where UTF-8 is required. Sample code for this down conversion is | ||||
| included in Appendix B. | ||||
| 3. MLSF alternative form | 3. MLSF alternative form | |||
| A MLSF alternative form string may contain alternative | A MLSF alternative form string may contain alternative | |||
| representations of the same text in different primary languages. | representations of the same text in different primary languages. | |||
| The octet with hexidecimal representation of FE is used to | The octet with hexadecimal representation of FE is used to | |||
| introduce a new alternative. This MUST be followed by a MLSF | introduce a new alternative. This MUST be followed by a MLSF | |||
| language tag for the primary language of the alternative. | language tag for the primary language of the alternative. | |||
| The component of the MLSF string prior to the first FE octet is | The component of the MLSF string prior to the first FE octet is | |||
| considered the "preferred" representation for the string. This is | considered the "preferred" representation for the string. This is | |||
| the version which will be displayed by MLSF clients which choose | the version which will be displayed by MLSF clients which choose | |||
| not to support alternative representations. The preferred | not to support alternative representations. The preferred | |||
| representation MAY be prefixed by a MLSF language tag. | representation MAY be prefixed by a MLSF language tag. | |||
| MLSF alternate form is defined by the MLSF-ALT rule in section 7. | MLSF alternate form is defined by the MLSF-ALT rule in section 7. | |||
| A quoted version of MLSF alternate form is defined by the | A quoted version of MLSF alternate form is defined by the | |||
| MLSF-ALT-QUOTED rule. | MLSF-ALT-QUOTED rule. | |||
| 4. Minimal Support: downconverting MLSF to UTF-8 | Note that MLSF alternate form is not compatible with UTF-8. A | |||
| program which uses MLSF MUST downconvert it to UTF-8 prior to using | ||||
| Minimal support for MLSF requires the ability to downconvert MLSF | it in a context where UTF-8 is required. Sample code for this down | |||
| to UTF-8. This is a simple procedure which selects the preferred | conversion is included in Appendix B. | |||
| alternative and strips all language tags. Sample code is included | ||||
| in Appendix B. All UTF-8 strings which do not contain a 0 octet | ||||
| are also MLSF strings. | ||||
| 5. MLSF MIME character sets | 4. MLSF MIME character sets | |||
| The character set label "XXXX-simple" has been registered to | The character set label "XXXX-simple" will be registered to | |||
| indicate the use of MLSF simple form. The character set label | indicate the use of MLSF simple form. The character set label | |||
| "XXXX-alt" has been registered to indicate the use of MLSF | "XXXX-alt" will be registered to indicate the use of MLSF alternate | |||
| alternate form. | form. | |||
| MLSF may be used in conjunction with MIME header [MIME-HDR] | MLSF may be used in conjunction with MIME header [MIME-HDR] | |||
| encoding to permit language tagging and alternative representations | encoding to permit language tagging and alternative representations | |||
| in header fields. | in header fields. A work in progress [MIME-LANG] will propose a | |||
| mechanism for language tagging in headers which is not dependent on | ||||
| the use of UTF-8. | ||||
| For single language MIME body parts, the UTF-8 character set with | For single language MIME body parts, the UTF-8 character set with | |||
| an appropriate Content-Language [LANG-TAG] header SHOULD be used | an appropriate Content-Language [LANG-TAG] header SHOULD be used | |||
| instead of MLSF. | instead of MLSF. Text/enriched [ENRICHED] or HTML with language | |||
| tags [HTML-I18N] are preferred to using MLSF for MIME bodies when | ||||
| possible. | ||||
| 6. Security Considerations | 5. Security Considerations | |||
| Multi-Lingual String Format is not believed to have any security | Multi-Lingual String Format is not believed to have any security | |||
| considerations beyond those for simple US-ASCII strings. In | considerations beyond those for simple US-ASCII strings. In | |||
| particular, unfiltered display of certain US-ASCII control | particular, unfiltered display of certain US-ASCII control | |||
| characters by a terminal emulator may result in modifying the | characters by a terminal emulator may result in modifying the | |||
| behavior of the terminal emulator (e.g. by redefining function | behavior of the terminal emulator (e.g. by redefining function | |||
| keys) such that security can be breached. Programs which display | keys) such that security can be breached. Programs which display | |||
| text to a potentially insecure terminal emulator channel are | text to a potentially insecure terminal emulator channel are | |||
| encouraged to remove control characters to avoid these problems. | encouraged to remove control characters to avoid these problems. | |||
| 7. Formal Grammar | 6. Formal Grammar | |||
| This section defines the formal grammar for MLSF using Augmented | This section defines the formal grammar for MLSF using Augmented | |||
| BNF [ABNF] notation. | BNF [ABNF] notation. | |||
| MLSF-ALT = [[MLSF-LANG-TAG] MLSF-COMPONENT | MLSF-ALT = [[MLSF-LANG-TAG] MLSF-COMPONENT | |||
| *(MLSF-ALTERNATE MLSF-COMPONENT)] | *(MLSF-ALTERNATE MLSF-COMPONENT)] | |||
| MLSF-ALT-QUOTED = <"> [[MLSF-LANG-TAG] MLSF-COMPONENT-Q | MLSF-ALT-QUOTED = <"> [[MLSF-LANG-TAG] MLSF-COMPONENT-Q | |||
| *(MLSF-ALTERNATE MLSF-COMPONENT-Q)] <"> | *(MLSF-ALTERNATE MLSF-COMPONENT-Q)] <"> | |||
| skipping to change at page 4, line 49 ¶ | skipping to change at page 4, line 49 ¶ | |||
| MLSF-SIMPLE = [[MLSF-LANG-TAG] MLSF-COMPONENT] | MLSF-SIMPLE = [[MLSF-LANG-TAG] MLSF-COMPONENT] | |||
| MLSF-SIMPLE-QUOTED = <"> [[MLSF-LANG-TAG] MLSF-COMPONENT-Q] <"> | MLSF-SIMPLE-QUOTED = <"> [[MLSF-LANG-TAG] MLSF-COMPONENT-Q] <"> | |||
| QUOTED = "\" QUOTED-SPECIAL | QUOTED = "\" QUOTED-SPECIAL | |||
| QUOTED-SPECIAL = "\" / <"> | QUOTED-SPECIAL = "\" / <"> | |||
| US-ASCII-SAFE = %x01..09 / %x0B..0C / %x0E..21 | US-ASCII-SAFE = %x01..09 / %x0B..0C / %x0E..21 | |||
| / %x23..2E / %x30..7F | / %x23..5B / %x5D..7F | |||
| ;; US-ASCII except QUOTED-SPECIALs, CR, LF, NUL | ;; US-ASCII except QUOTED-SPECIALs, CR, LF, NUL | |||
| UTF8-NON-NUL = UTF8-SAFE / CR / LF / QUOTED-SPECIAL | UTF8-NON-NUL = UTF8-SAFE / CR / LF / QUOTED-SPECIAL | |||
| UTF8-QUOTED = UTF8-SAFE / QUOTED | UTF8-QUOTED = UTF8-SAFE / QUOTED | |||
| UTF8-SAFE = US-ASCII-SAFE / UTF8-1 / UTF8-2 / UTF8-3 | UTF8-SAFE = US-ASCII-SAFE / UTF8-1 / UTF8-2 / UTF8-3 | |||
| / UTF8-4 / UTF8-5 | / UTF8-4 / UTF8-5 | |||
| UTF8-CONT = %x80..BF | UTF8-CONT = %x80..BF | |||
| UTF8-1 = %xC0..DF UTF8-CONT | UTF8-1 = %xC0..DF UTF8-CONT | |||
| UTF8-2 = %xE0..EF 2UTF8-CONT | UTF8-2 = %xE0..EF 2UTF8-CONT | |||
| UTF8-3 = %xF0..F7 3UTF8-CONT | UTF8-3 = %xF0..F7 3UTF8-CONT | |||
| UTF8-4 = %xF8..FB 4UTF8-CONT | UTF8-4 = %xF8..FB 4UTF8-CONT | |||
| UTF8-5 = %xFC..FD 5UTF8-CONT | UTF8-5 = %xFC..FD 5UTF8-CONT | |||
| 8. References | 7. References | |||
| [ABNF] Crocker, D., "Augmented BNF for Syntax Specifications: | [ABNF] Crocker, D., "Augmented BNF for Syntax Specifications: | |||
| ABNF", Work in progress: draft-ietf-drums-abnf-xx.txt | ABNF", Work in progress: draft-ietf-drums-abnf-xx.txt | |||
| [ENRICHED] Resnick, Walker, "The text/enriched MIME Content-type", | ||||
| RFC 1896, Qualcomm, InterCon, February 1996. | ||||
| <ftp://ds.internic.net/rfc/rfc1896.txt> | ||||
| [HTML-I18N] Yergeau, Nicol, Adams, Duerst, "Internationalization of | ||||
| the Hypertext Markup Language", RFC 2070, Alis Technologies, | ||||
| Electronic Book Technologies, Spyglass, University of Zurich, | ||||
| January 1997. | ||||
| <ftp://ds.internic.net/rfc/rfc2070.txt> | ||||
| [IAB-CHARSET] Weider, Preston, Simonsen, Alvestrand, Atkinson, | ||||
| Crispin, Svanberg, "The Report of the IAB Character Set Workshop | ||||
| held 29 February - 1 March, 1996", RFC 2130, April 1997. | ||||
| <ftp://ds.internic.net/rfc/rfc2130.txt> | ||||
| [IMAP4] Crispin, "Internet Message Access Protocol - Version | ||||
| 4rev1", RFC 2060, University of Washington, December 1996. | ||||
| <ftp://ds.internic.net/rfc/rfc2060.txt> | ||||
| [KEYWORDS] Bradner, "Key words for use in RFCs to Indicate | [KEYWORDS] Bradner, "Key words for use in RFCs to Indicate | |||
| Requirement Levels", RFC 2119, Harvard University, March 1997. | Requirement Levels", RFC 2119, Harvard University, March 1997. | |||
| <ftp://ds.internic.net/rfc/rfc2119.txt> | <ftp://ds.internic.net/rfc/rfc2119.txt> | |||
| [LANG-TAGS] Alvestrand, H., "Tags for the Identification of | [LANG-TAGS] Alvestrand, H., "Tags for the Identification of | |||
| Languages", RFC 1766. | Languages", RFC 1766. | |||
| <ftp://ds.internic.net/rfc/rfc1766.txt> | <ftp://ds.internic.net/rfc/rfc1766.txt> | |||
| skipping to change at page 6, line 5 ¶ | skipping to change at page 6, line 27 ¶ | |||
| 2047, University of Tennessee, November 1996. | 2047, University of Tennessee, November 1996. | |||
| <ftp://ds.internic.net/rfc/rfc2047.txt> | <ftp://ds.internic.net/rfc/rfc2047.txt> | |||
| [MIME-IMB] Freed, Borenstein, "Multipurpose Internet Mail | [MIME-IMB] Freed, Borenstein, "Multipurpose Internet Mail | |||
| Extensions (MIME) Part One: Format of Internet Message Bodies", RFC | Extensions (MIME) Part One: Format of Internet Message Bodies", RFC | |||
| 2045, Innosoft, First Virtual, November 1996. | 2045, Innosoft, First Virtual, November 1996. | |||
| <ftp://ds.internic.net/rfc/rfc2045.txt> | <ftp://ds.internic.net/rfc/rfc2045.txt> | |||
| [MIME-LANG] Freed, Moore, "MIME Parameter Value and Encoded Words: | ||||
| Character Sets, Language, and Continuations", work in progress, | ||||
| March 1997. | ||||
| [UTF8] Yergeau, F. "UTF-8, a transformation format of Unicode and | [UTF8] Yergeau, F. "UTF-8, a transformation format of Unicode and | |||
| ISO 10646", RFC 2044, Alis Technologies, October 1996. | ISO 10646", RFC 2044, Alis Technologies, October 1996. | |||
| <ftp://ds.internic.net/rfc/rfc2044.txt> | <ftp://ds.internic.net/rfc/rfc2044.txt> | |||
| 9. Acknowledgements | 8. Acknowledgements | |||
| Special thanks to Mark Crispin for the idea of using unused UTF-8 | Special thanks to Mark Crispin for the idea of using unused UTF-8 | |||
| codes for this purpose. Thanks are also due to participants of | codes for this purpose. Thanks are also due to participants of | |||
| the ACAP WG mailing list who helped review this proposal. | the ACAP WG mailing list who helped review this proposal. | |||
| 10. Author's Address | 9. Author's Address | |||
| Chris Newman | Chris Newman | |||
| Innosoft International, Inc. | Innosoft International, Inc. | |||
| 1050 East Garvey Ave. South | 1050 East Garvey Ave. South | |||
| West Covina, CA 91790 USA | West Covina, CA 91790 USA | |||
| Email: chris.newman@innosoft.com | Email: chris.newman@innosoft.com | |||
| Appendix A. Client advice | Appendix A. Client advice | |||
| skipping to change at line 504 ¶ | skipping to change at page 13, line 38 ¶ | |||
| } | } | |||
| /* skip to next MLSF component */ | /* skip to next MLSF component */ | |||
| while (*str != '\0' && *str++ != 0xFEU) | while (*str != '\0' && *str++ != 0xFEU) | |||
| ; | ; | |||
| } while (*str != '\0'); | } while (*str != '\0'); | |||
| } | } | |||
| return (best); | return (best); | |||
| } | } | |||
| Appendix F. Background and Alternate Solutions | ||||
| MLSF was designed to deal with language tagging in the context of | ||||
| the ACAP protocol, but is believed to be useful in other contexts. | ||||
| Specific scenarios cited during discussion were human names in | ||||
| address books, system administrator alert error messages, and error | ||||
| messages which include identifiers potentially in a different | ||||
| language from the client's preferred error message language. Since | ||||
| ACAP is an arbitrary attribute-value protocol, it is impossible to | ||||
| imaging all possible scenarios in advance, so a general purpose | ||||
| mechanism was needed. | ||||
| There have been several attempts to solve language tagging in | ||||
| attribute value protocols. RFC 822 poses a particularly | ||||
| troublesome scenario, since headers must be 7-bit. The MIME | ||||
| solution to label character sets [MIME-HDR] and languages [MIME- | ||||
| LANG] in headers is thus a necessary evil. The result of this is | ||||
| to make header searching services such as those provided by IMAP | ||||
| [IMAP4] massively more complex. If 8-bit headers were permitted a | ||||
| solution like MLSF would have been far simpler and more efficient. | ||||
| Another approach taken is demonstrated by the current vCard, | ||||
| iCalendar, and LDAPv3 proposals (all works in progress). These | ||||
| proposals overload the attribute namespace to provide language | ||||
| tagging and creates a concept roughly described as attributes of | ||||
| the attribute. The result of this is that clients have to deal | ||||
| with a multiple attribute response to a query where each attribute | ||||
| may have multiple values. The additional complexity this adds to | ||||
| client processing was deemed unacceptable for ACAP where client | ||||
| simplicity was an important design goal. | ||||
| Another possible approach is the use of a markup language such as | ||||
| text/enriched [ENRICHED]. While this is certainly a suitable | ||||
| language tagging solution for large text objects such as MIME | ||||
| bodies, it is unsuitable for the attribute-value model where | ||||
| searching is a primary function. | ||||
| End of changes. 22 change blocks. | ||||
| 48 lines changed or deleted | 78 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||