TOC 
EAIA. Yang
Internet-DraftTWNIC
Obsoletes: 5335 (if approved)S. Steele
Updates: 2045, 5322Microsoft
(if approved)D. Crocker
Intended status: Standards TrackBrandenburg InternetWorking
Expires: July 29, 2011N. Freed
 Oracle
 January 25, 2011


Internationalized Email Headers
draft-ietf-eai-rfc5335bis-08

Abstract

Internet mail was originally limited to 7-bit ASCII. Recent enhancements support Unicode's UTF-8 encoding in portions of a message. Full internationalization of electronic mail requires additional enhancement, including support for UTF-8 in user-oriented header fields, such as in the To, From, and Subject fields. This document specifies an enhancement to Internet mail that permits native UTF-8 support in the header and body of a message.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

This Internet-Draft will expire on July 29, 2011.

Copyright Notice

Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.



Table of Contents

1.  Introduction
    1.1.  Changes to the 8-bit clean Model
    1.2.  Terminology
2.  Support for UTF-8 Encoding
    2.1.  Message Object ABNF Changes
    2.2.  Normalization
    2.3.  Content-Transfer-Encoding
3.  Internet Message Format Enhancement
4.  Message Labeling
5.  MIME Enhancement
    5.1.  Content-Transfer-Encoding
    5.2.  MIME Header Field
    5.3.  Content-Type: message/utf8-rfc822
6.  Security Considerations
7.  IANA Considerations
8.  Acknowledgements
9.  References
    9.1.  Normative References
    9.2.  Informative References
Appendix A.  Changes to support UTF-8




 TOC 

1.  Introduction

Internet mail distinguishes a message from its transport and further divides a message between a header and a body [RFC5598] (Crocker, D., “Internet Mail Architecture,” July 2009.). Internet mail header fields contain a variety of strings that are intended to be user-visible. The range of supported characters for these strings was originally limited to a subset of [ASCII] (, “Coded Character Set -- 7-bit American Standard Code for Information Interchange,” 1986.); globalization of the Internet requires support of the much larger set contained in UTF-8 [RFC5198] (Klensin, J. and M. Padlipsky, “Unicode Format for Network Interchange,” March 2008.). Complex encoding alternatives to UTF-8, as an overlay to the existing ASCII base, would introduce inefficiencies as well as opportunities for processing errors. Native support for UTF-8 encoding [RFC3629] (Yergeau, F., “UTF-8, a transformation format of ISO 10646,” November 2003.) is widely available among systems now used over the Internet. Hence supporting this encoding directly within email is desired. This document specifies an enhancement to Internet mail that permits the use of UTF-8 encoding, rather than only ASCII, as the base form for header fields.

This specification is based on a model of native, end-to-end support for UTF-8, which uses an "8-bit clean" environment . Support for carriage across legacy, 7-bit infrastructure and for processing by 7-bit receivers requires additional mechanisms that are not provided by this specification.



 TOC 

1.1.  Changes to the 8-bit clean Model

This is an extensive revision to the draft. Changes include:

Still Pending:

The goal of the changes is to dramatically simplify the specification and the software needed to support a message with UTF8 encoding. Rather than specify a wide range of UTF8-specific changes to the existing ABNF rules, it focuses on the few, underlying ABNF rules that are the basis for user-visible ASCII text. The premise for this is simple: If the message is to be in UTF-8, then it is in UTF-8. Subtle or complex rules that selectively add UTF-8 are not worth the effort, once the message has already entered into the realm of UTF-8.

The question, then, is whether this change has planted some landmines, such as in Trace header fields?



 TOC 

1.2.  Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.).

Syntax descriptions use Augmented BNF (ABNF) [RFC5234] (Crocker, D. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” January 2008.).

Basic terms for this specification include:

ASCII:
An encoding of Control Characters and Basic Latin that occupies 7-bits, per [ASCII] (, “Coded Character Set -- 7-bit American Standard Code for Information Interchange,” 1986.). Such a string is fully compatible with email as specified in [RFC5322] (Resnick, P., Ed., “Internet Message Format,” October 2008.).
UTF-8:
An encoding of Unicode in 8-bit bytes, per [RFC3629] (Yergeau, F., “UTF-8, a transformation format of ISO 10646,” November 2003.).


 TOC 

2.  Support for UTF-8 Encoding



 TOC 

2.1.  Message Object ABNF Changes

Internet Mail that conforms to this specification is classed as supporting UTF-8. However, UTF-8 characters within the ASCII range retain the restrictions defined for original, legacy, Latin-only email. Therefore, ABNF enhancements to include UTF-8 incrementally add the non-ASCII portions of UTF-8 to that established base of ASCII.

UTF-8 characters are defined by using the following ABNF taken from [RFC3629] (Yergeau, F., “UTF-8, a transformation format of ISO 10646,” November 2003.):

UTF8-enhancement  =   UTF8-2 / UTF8-3 / UTF8-4

UTF8-2            =   <See Section 4 of RFC3629>

UTF8-3            =   <See Section 4 of RFC3629>

UTF8-4            =   <See Section 4 of RFC3629>



 TOC 

2.2.  Normalization

See [RFC5198] (Klensin, J. and M. Padlipsky, “Unicode Format for Network Interchange,” March 2008.) for a discussion of normalization. A normalized form [NFC] (Davis, M. and K. Whistler, “Unicode Standard Annex #15: Unicode Normalization Forms,” September 2010.) MAY be used. However [NFC] (Davis, M. and K. Whistler, “Unicode Standard Annex #15: Unicode Normalization Forms,” September 2010.) can lose information that is needed to correctly spell some names in unusual circumstances.



 TOC 

2.3.  Content-Transfer-Encoding

This specification is based on a requirement for an "8-bit clean" infrastructure. Support for UTF-8 semantics within a 7-bit environment requires translation conventions that are not specified here. Consequently a Content-Transfer-Encoding value of 7-bit is not useful for a message that is labeled as containing UTF-8.



 TOC 

3.  Internet Message Format Enhancement

This section specifies UTF-8 enhancements for the header of an Internet Mail message, as defined in [RFC5322] (Resnick, P., Ed., “Internet Message Format,” October 2008.).

ABNF used in this section is taken from that specification and the ABNF specification.

This specification retains the [RFC5322] (Resnick, P., Ed., “Internet Message Format,” October 2008.) rules for defining header field names. The bodies of header fields are allowed to contain UTF-8 characters, but the header field names themselves must contain only ASCII characters.

The following rules extend the corresponding rules in [RFC5322] (Resnick, P., Ed., “Internet Message Format,” October 2008.) and [RFC5234] (Crocker, D. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” January 2008.) in order to allow additional UTF-8 characters.

VCHAR   =/  UTF8-non-ascii

ctext   =/  UTF8-enhancement

atext   =/  UTF8-enhancement

qtext   =/  UTF8-enhancement

TENTATIVE (DCrocker):
text    =/  UTF8-enhancement
               ; note that this upgrades the body to UTF-8

{{ how to add IDN to this? }}
domain  =   dot-atom / domain-literal / obs-domain

This means that all the [RFC5322] (Resnick, P., Ed., “Internet Message Format,” October 2008.) constructs that build upon these will permit UTF-8 characters, including comments and quoted strings.

<field-name>
[RFC5322] (Resnick, P., Ed., “Internet Message Format,” October 2008.) has the rule <field-name> which specifies permissible names for user-defined header fields. The current specification defines no changes to that rule.
<msg-id>
This ABNF enables Message-ID strings to be full UTF-8. However the specification directs that Message-ID strings SHOULD be restricted to ASCII.


 TOC 

4.  Message Labeling

For clarity and convenience, a message SHOULD contain an explicit label indicating the character base it uses. This section defines a new header field for this label:

fields         =/ msg-character
msg-character  = "MSG-Char:" ( "ASCII" / "UTF-8" ) CRLF



 TOC 

5.  MIME Enhancement



 TOC 

5.1.  Content-Transfer-Encoding

The default "Content-Transfer-Encoding: is 8BIT" and is assumed if the Content-Transfer-Encoding header field is not present.



 TOC 

5.2.  MIME Header Field

MIME contains at least one header field that is intended for user display, namely Content-Description. This section specifies UTF-8 enhancements to MIME header fields, as defined in [RFC2045] (Freed, N. and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies,” November 1996.). ABNF rules used in this section is taken from that specification and the ABNF specification.

The enhanced ABNF rules are:

text    =/ UTF8-non-ascii



 TOC 

5.3.  Content-Type: message/utf8-rfc822

The type message/utf-rfc822 is similar to message/rfc822. However it specifies that characters are interpreted as UTF-8 rather than being limited to ASCII.

Type name:
message
Subtype name:
utf8-rfc822
Required parameters:
none
Optional parameters:
none
Encoding considerations:
Any content-transfer-encoding is permitted. The 8-bit or binary content-transfer-encodings are recommended where permitted.
Security considerations:
See Section 6 (Security Considerations).
Interoperability considerations:
The media type provides functionality similar to the message/rfc822 content type for email messages with international email headers. When there is a need to embed or return such content in another message, there is generally an option to use this media type and leave the content unchanged or down-convert the content to message/rfc822. Both of these choices will interoperate with the installed base, but with different properties. Systems unaware of internationalized headers will typically treat a message/utf8-rfc822 body part as an unknown attachment, while they will understand the structure of a message/rfc822. However, systems that understand message/utf8-rfc822 will provide functionality superior to the result of a down-conversion to message/rfc822. The most interoperable choice depends on the deployed software.
Published specification:
RFC XXXX
Applications that use this media type:
SMTP servers and email clients that support multipart/report generation or parsing. Email clients which forward messages with international headers as attachments.
Additional information:
Magic number(s):
none
File extension(s):
The extension ".u8msg" is suggested.
Macintosh file type code(s):
A uniform type identifier (UTI) of "public.utf8-email-message" is suggested. This conforms to "public.message" and "public.composite-content", but does not necessarily conform to "public.utf8-plain-text".
Person & email address to contact for further information:
See the Author's Address section of this document.
Intended usage:
COMMON
Restrictions on usage:
This is a structured media type which embeds other MIME media types. The 8-bit or binary content-transfer-encoding SHOULD be used unless this media type is sent over a 7-bit-only transport.
Author:
See the Author's Address section of this document.
Change controller:
IETF Standards Process



 TOC 

6.  Security Considerations

If a user has a mailbox address in UTF-8 and a mailbox address in ASCII, a digital certificate that identifies that user might have both addresses in the identity. Having multiple email addresses as identities in a single certificate is already supported in PKIX (Public Key Infrastructure for X.509 Certificates) [RFC5280] (Cooper, D., Santesson, S., Farrell, S., Boeyen, S., Housley, R., and W. Polk, “Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile,” May 2008.) and OpenPGP [RFC3156] (Elkins, M., Del Torto, D., Levien, R., and T. Roessler, “MIME Security with OpenPGP,” August 2001.).

Because UTF-8 often requires several octets to encode a single character, internationalized local parts and header value may cause mail addresses to become longer. As specified in [RFC5322] (Resnick, P., Ed., “Internet Message Format,” October 2008.), each line of characters MUST be no more 998 octets, excluding the CRLF. On the other hand, MDA (Mail Delivery Agent) processes that parse, store, or handle email addresses or local parts must take extra care not to overflow buffers, truncate addresses, or exceed storage allotments. Also, they must take care, when comparing, to use the entire lengths of the addresses.

The security impact of UTF-8 headers on email signature systems such as Domain Keys Identified Mail (DKIM), S/MIME, and OpenPGP is discussed in [I‑D.eai‑frmwrk‑4952bis] (Klensin, J. and Y. Ko, “Overview and Framework for Internationalized Email,” September 2010.), Section 14.



 TOC 

7.  IANA Considerations

IANA is requested to update the registration of the message/utf8-rfc822 MIME type using the registration form contained in Section 5.3 (Content-Type: message/utf8-rfc822).



 TOC 

8.  Acknowledgements

This document incorporates many ideas first described in Internet-Draft form by Paul Hoffman, although many details have changed from that earlier work.

The author especially thanks Jeff Yeh for his efforts and contributions on editing previous versions.

Most of the content of this document is provided by John C Klensin. Also, some significant comments and suggestions were received from Charles H. Lindsey, Kari Hurtta, Pete Resnick, Alexey Melnikov, Chris Newman, Yangwoo Ko, Yoshiro Yoneya, and other members of the JET team (Joint Engineering Team) and were incorporated into the document. The editor sincerely thanks them for their contributions.



 TOC 

9.  References



 TOC 

9.1. Normative References

[ASCII] “Coded Character Set -- 7-bit American Standard Code for Information Interchange,” ANSI X3.4, 1986.
[I-D.eai-frmwrk-4952bis] Klensin, J. and Y. Ko, “Overview and Framework for Internationalized Email,” draft-ietf-eai-frmwrk-4952bis-10 (work in progress), September 2010 (TXT).
[Latin] Unicode Consortium, “C0 Controls and Basic Latin,” http://unicode.org /charts/PDF/U0000.pdf, 2010.
[NFC] Davis, M. and K. Whistler, “Unicode Standard Annex #15: Unicode Normalization Forms,” September 2010.
[RFC2119] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML).
[RFC3629] Yergeau, F., “UTF-8, a transformation format of ISO 10646,” STD 63, RFC 3629, November 2003 (TXT).
[RFC5198] Klensin, J. and M. Padlipsky, “Unicode Format for Network Interchange,” RFC 5198, March 2008 (TXT).
[RFC5234] Crocker, D. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” STD 68, RFC 5234, January 2008 (TXT).
[RFC5322] Resnick, P., Ed., “Internet Message Format,” RFC 5322, October 2008 (TXT, HTML, XML).
[RFC5598] Crocker, D., “Internet Mail Architecture,” RFC 5598, July 2009.
[Unicode] Unicode Consortium, “Unicode 6.0 Character Code Charts,” http://unicode.org /charts/, 2010.


 TOC 

9.2. Informative References

[RFC2045] Freed, N. and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies,” RFC 2045, November 1996 (TXT).
[RFC2046] Freed, N. and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types,” RFC 2046, November 1996 (TXT).
[RFC3156] Elkins, M., Del Torto, D., Levien, R., and T. Roessler, “MIME Security with OpenPGP,” RFC 3156, August 2001 (TXT).
[RFC5280] Cooper, D., Santesson, S., Farrell, S., Boeyen, S., Housley, R., and W. Polk, “Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile,” RFC 5280, May 2008 (TXT).


 TOC 

Appendix A.  Changes to support UTF-8

This section provides a basic audit of the places in a message that now can permit UTF-8 rather than being restricted to ASCII, based on the changes to underlying ABNF. The audit ignores rule for "obsolete" constructs in RFC 5322. (This is a first cut and the list is likely incomplete):

VCHAR:
quoted-pair, unstructured
> ccontent, qcontent
> comment, quoted-string
> word, local-part
> phrase
> display-name, keywords
ctext:
ccontent > comment
atext:
atom, dot-atom-text
qtext:
qcontent > quoted-string



 TOC 

Authors' Addresses

  Abel Yang
  TWNIC
  4F-2, No. 9, Sec 2, Roosevelt Rd.
  Taipei, 100
  Taiwan
Phone:  +886 2 23411313 ext 505
EMail:  abelyang@twnic.net.tw
  
  Shawn Steele
  Microsoft
EMail:  Shawn.Steele@microsoft.com
  
  D. Crocker
  Brandenburg InternetWorking
  675 Spruce Dr.
  Sunnyvale
  USA
Phone:  +1.408.246.8253
EMail:  dcrocker@bbiw.net
URI:  http://bbiw.net
  
  Ned Freed
  Oracle
  800 Royal Oaks
  Monrovia, CA 91016-6347
  USA
EMail:  ned.freed@mrochek.com