idnits 2.17.1 draft-ietf-appsawg-malformed-mail-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 22, 2013) is 3808 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 1113 (ref. 'PEM') (Obsoleted by RFC 1421) -- Obsolete informational reference (is this intentional?): RFC 2822 (Obsoleted by RFC 5322) -- Obsolete informational reference (is this intentional?): RFC 733 (Obsoleted by RFC 822) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 APPSAWG M. Kucherawy 3 Internet-Draft G. Shapiro 4 Intended status: Informational N. Freed 5 Expires: May 26, 2014 November 22, 2013 7 Advice for Safe Handling of Malformed Messages 8 draft-ietf-appsawg-malformed-mail-11 10 Abstract 12 Although Internet mail formats have been precisely defined since the 13 1970s, authoring and handling software often show only mild 14 conformance to the specifications. The malformed messages that 15 result are non-standard. Nonetheless, decades of experience has 16 shown that handling with some tolerance the malformations that result 17 is often an acceptable approach, and is better than rejecting the 18 messages outright as nonconformant. This document includes a 19 collection of the best advice available regarding a variety of common 20 malformed mail situations, to be used as implementation guidance. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on May 26, 2014. 39 Copyright Notice 41 Copyright (c) 2013 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. The Purpose Of This Work . . . . . . . . . . . . . . . . . 3 58 1.2. Not The Purpose Of This Work . . . . . . . . . . . . . . . 4 59 1.3. General Considerations . . . . . . . . . . . . . . . . . . 4 60 2. Document Conventions . . . . . . . . . . . . . . . . . . . . . 5 61 2.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 5 62 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . . 5 63 4. Invariant Content . . . . . . . . . . . . . . . . . . . . . . 5 64 5. Mail Submission Agents . . . . . . . . . . . . . . . . . . . . 6 65 6. Line Termination . . . . . . . . . . . . . . . . . . . . . . . 7 66 7. Header Anomalies . . . . . . . . . . . . . . . . . . . . . . . 7 67 7.1. Converting Obsolete and Invalid Syntaxes . . . . . . . . . 7 68 7.1.1. Host-Address Syntax . . . . . . . . . . . . . . . . . 8 69 7.1.2. Excessive Angle Brackets . . . . . . . . . . . . . . . 8 70 7.1.3. Unbalanced Angle Brackets . . . . . . . . . . . . . . 8 71 7.1.4. Unbalanced Parentheses . . . . . . . . . . . . . . . . 8 72 7.1.5. Commas in Address Lists . . . . . . . . . . . . . . . 9 73 7.1.6. Unbalanced Quotes . . . . . . . . . . . . . . . . . . 9 74 7.1.7. Naked Local-Parts . . . . . . . . . . . . . . . . . . 10 75 7.2. Non-Header Lines . . . . . . . . . . . . . . . . . . . . . 10 76 7.3. Unusual Spacing . . . . . . . . . . . . . . . . . . . . . 11 77 7.4. Header Malformations . . . . . . . . . . . . . . . . . . . 12 78 7.5. Header Field Counts . . . . . . . . . . . . . . . . . . . 12 79 7.5.1. Repeated Header Fields . . . . . . . . . . . . . . . . 14 80 7.5.2. Missing Header Fields . . . . . . . . . . . . . . . . 15 81 7.5.3. Return-Path . . . . . . . . . . . . . . . . . . . . . 16 82 7.6. Missing or Incorrect Charset Information . . . . . . . . . 16 83 7.7. Eight-Bit Data . . . . . . . . . . . . . . . . . . . . . . 17 84 8. MIME Anomalies . . . . . . . . . . . . . . . . . . . . . . . . 18 85 8.1. Missing MIME-Version Field . . . . . . . . . . . . . . . . 18 86 8.2. Faulty Encodings . . . . . . . . . . . . . . . . . . . . . 18 87 9. Body Anomalies . . . . . . . . . . . . . . . . . . . . . . . . 19 88 9.1. Oversized Lines . . . . . . . . . . . . . . . . . . . . . 19 89 10. Security Considerations . . . . . . . . . . . . . . . . . . . 19 90 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 91 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20 92 12.1. Normative References . . . . . . . . . . . . . . . . . . . 20 93 12.2. Informative References . . . . . . . . . . . . . . . . . . 20 94 Appendix A. RFC Editor Notes . . . . . . . . . . . . . . . . . . 21 95 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 21 97 1. Introduction 99 1.1. The Purpose Of This Work 101 The history of email standards, going back to [RFC733] and beyond, 102 contains a fairly rigid evolution of specifications However, 103 implementations within that culture have also long had an 104 undercurrent known formally as the robustness principle, also known 105 informally as Postel's Law: "Be liberal in what you accept, and 106 conservative in what you send." [RFC1122] 108 Jon Postel's directive is often misinterpreted to mean that any 109 deviance from a specification is acceptable. Rather, it was intended 110 only to account for legitimate variations in interpretation within 111 specifications, as well as basic transit errors, like bit errors. 112 Taken to its unintended extreme, excessive tolerance would imply that 113 there are no limits to the liberties that a sender might take, while 114 presuming a burden on a receiver to guess "correctly" at the meaning 115 of any such variation. These matters are further compounded by 116 receiver software -- the end users' mail readers -- which are also 117 sometimes flawed, leaving senders to craft messages (sometimes 118 bending the rules) to overcome those flaws. 120 In general, this served the email ecosystem well by allowing a few 121 errors in implementations without obstructing participation in the 122 game. The proverbial bar was set low. However, as we have evolved 123 into the current era, some of these lenient stances have begun to 124 expose opportunities that can be exploited by malefactors. Various 125 email-based applications rely on strong application of these 126 standards for simple security checks, while the very basic building 127 blocks of that infrastructure, intending to be robust, fail utterly 128 to assert those standards. 130 The distributed and non-interactive nature of email has often 131 prompted adjustments to receiving software, to handle these 132 variations, rather than trying to gain better conformance by senders, 133 since the receiving operator is primarily driven by complaints from 134 recipient users and has no authority over the sending side of the 135 system. Processing with such flexibility comes at some cost, since 136 mail software is faced with decisions about whether to permit non- 137 conforming messages to continue toward their destinations unaltered, 138 adjust them to conform (possibly at the cost of losing some of the 139 original message), or outright rejecting them. 141 This document includes a collection of the best advice available 142 regarding a variety of common malformed mail situations, to be used 143 as implementation guidance. These malformations are typically based 144 around loose interpretations or implementations of specifications 145 such as Internet Message Format [MAIL] and Multipurpose Internet Mail 146 Extensions [MIME]. 148 1.2. Not The Purpose Of This Work 150 It is important to understand that this work is not an effort to 151 endorse or standardize certain common malformations. The code and 152 culture that introduces such messages into the mail stream needs to 153 be repaired, as the security penalty now being paid for this lax 154 processing arguably outweighs the reduction in support costs to end 155 users who are not expected to understand the standards. However, the 156 reality is that this will not be fixed quickly. 158 Given this, it is beneficial to provide implementers with guidance 159 about the safest or most effective way to handle malformed messages 160 when they arrive, taking into consideration the tradeoffs of the 161 choices available especially with respect to how various actors in 162 the email ecosystem respond to such messages in terms of handling, 163 parsing, or rendering to end users. 165 1.3. General Considerations 167 Many deviations from message format standards are considered by some 168 receivers to be strong indications that the message is undesirable, 169 such as spam or something containing malware. These receivers 170 quickly decide that the best handling choice is simply to reject or 171 discard the message. This means malformations caused by innocent 172 misunderstandings or ignorance of proper syntax can cause messages 173 with no ill intent also to fail to be delivered. 175 Senders that want to ensure message delivery are best advised to 176 adhere strictly to the relevant standards (including, but not limited 177 to, [MAIL], [MIME], and [DKIM]), as well as observe other industry 178 best practices such as may be published from time to time either by 179 the IETF or independently. 181 Receivers that haven't the luxury of strict enforcement of the 182 standards on inbound messages are usually best served by observing 183 the following guidelines for handling of malformed messages: 185 1. Whenever possible, mitigation of syntactic malformations should 186 be guided by an assessment of the most likely semantic intent. 187 For example, it is reasonable to conclude that multiple sets of 188 angle brackets around an address are simply superflous and can be 189 dropped. 191 2. When the intent is unclear, or when it is clear but also 192 impractical to change the content to reflect that intent, 193 mitigation should be limited to cases where not taking any 194 corrective action would clearly lead to a worse outcome. 196 3. Security issues, when present, need to be addressed and may force 197 mitigation strategies that are otherwise suboptimal. 199 2. Document Conventions 201 2.1. Examples 203 Examples of message content include a number within braces at the end 204 of each line. These are line numbers for use in subsequent 205 discussion, and are not actually part of the message content 206 presented in the example. 208 Blank lines are not numbered in the examples. 210 3. Background 212 The reader would benefit from reading [EMAIL-ARCH] for some general 213 background about the overall email architecture. Of particular 214 interest is the Internet Message Format, detailed in [MAIL]. 215 Throughout this document, the use of the term "message" should be 216 assumed to mean a block of text conforming to the Internet Message 217 Format. 219 4. Invariant Content 221 An agent handling a message could use several distinct 222 representations of the message. One is an internal representation, 223 such as separate blocks of storage for the header and body, some 224 header or body alterations, or tables indexed by header name, set up 225 to make particular kinds of processing easier. The other is the 226 representation passed along to the next agent in the handling chain. 227 This might be identical to the message input to the module, or it 228 might have some changes such as added or reordered header fields or 229 body elisions to remove malicious content. 231 Message handling is usually most effective when each in a sequence of 232 handling modules receives the same content for analysis. A module 233 that "fixes" or otherwise alters the content passed to later modules 234 can prevent the later modules from identifing malicious or other 235 content that exposes the end user to harm. It is important that all 236 processing modules can make consistent assertions about the content. 237 Modules that operate sequentially sometimes add private header fields 238 to relay information downstream for later filters to use (and 239 possibly remove), or they may have out-of-band ways of doing so. 240 However, even the presence of private header fields can impact a 241 downstream handling agent unaware of its local semantics, so an out- 242 of-band method is always preferable. 244 The above is less of a concern when multiple analysis modules are 245 operated in parallel, independent of one another. 247 Often, abuse reporting systems can act effectively only when a 248 complaint or report contains the original message exactly as it was 249 generated. Messages that have been altered by handling modules might 250 render a complaint inactionable as the system receiving the report 251 may be unable to identify the original message as one of its own. 253 Some message changes alter syntax without changing semantics. For 254 example, Section 7.4 describes a situation where an agent removes 255 additional header whitespace. This is a syntax change without a 256 change in semantics, though some systems (such as DKIM) are sensitive 257 to such changes. Message system developers need to be aware of the 258 downstream impact of making either kind of change. 260 Where a change to content between modules is unavoidable, adding 261 trace data (such as prepending a standard Received field) will at 262 least allow tracing of the handling by modules that actually see 263 different input. 265 There will always be local handling exceptions, but these guidelines 266 should be useful for developing integrated message processing 267 environments. 269 In most cases, this document only discusses techniques used on 270 internal representations. It is occasionally necessary to make 271 changes between the input and output versions; such cases will be 272 called out explicitly. 274 5. Mail Submission Agents 276 Within the email context, the single most influential component that 277 can reduce the presence of malformed items in the email system is the 278 Mail Handling Service (MHS; see [EMAIL-ARCH]), which includes the 279 Mail Submission Agent (MSA). This is the component that is 280 essentially the interface between end users that create content and 281 the mail stream. 283 MHSes need to become more strict about enforcement of all relevant 284 email standards, especially [MAIL] and the [MIME] family of 285 documents. 287 More strict conformance by relaying Mail Transfer Agents (MTAs) will 288 also be helpful. although preventing the dissemination of malformed 289 messages is desirable, the rejection of such mail already in transit 290 also has a support cost, namely the creation of a [DSN] that many end 291 users might not understand. 293 6. Line Termination 295 For interoperable Internet Mail messages, the only valid line 296 separation sequence during a typical SMTP session is ASCII 0x0D 297 ("carriage return", or CR) followed by ASCII 0x0A ("line feed", or 298 LF), commonly referred to as CRLF. This is not the case for binary 299 mode SMTP (see [BINARYSMTP]). 301 Common UNIX user tools, however, typically only use LF for internal 302 line termination. This means that a protocol engine that converts 303 between UNIX and Internet Mail formats has to convert between these 304 two end-of-line representations before transmitting a message or 305 after receiving it. 307 Non-compliant implementations can create messages with a mix of line 308 terminations, such as LF everywhere except CRLF only at the end of 309 the message. According to [SMTP] and [MAIL], this means the entire 310 message actually exists on a single line. 312 Within modern Internet Mail it is highly unlikely that an isolated CR 313 or LF is valid in common ASCII text. Furthermore, when content 314 actually does need to contain such an unusual character sequence, 315 [MIME] provides mechanisms for encoding that content in an SMTP-safe 316 manner. 318 Thus, it will typically be safe and helpful to treat an isolated CR 319 or LF as equivalent to a CRLF when parsing a message. 321 Note that this advice pertains only to the raw SMTP data, and not to 322 decoded MIME entities. As noted above, when MIME encoding mechanisms 323 are used, the unusual character sequences are not visible in the raw 324 SMTP stream. 326 7. Header Anomalies 328 This section covers common syntactic and semantic anomalies found in 329 a message header, and presents suggested mitigations. 331 7.1. Converting Obsolete and Invalid Syntaxes 333 A message using an obsolete header syntax (see Section 4 of [MAIL]) 334 might confound an agent that is attempting to be robust in its 335 handling of syntax variations. A bad actor could exploit such a 336 weakness in order to get abusive or malicious content through a 337 filter. This section presents some examples of such variations. 338 Messages including them ought be rejected; where this is not 339 possible, recommended internal interpretations are provided. 341 7.1.1. Host-Address Syntax 343 The following obsolete syntax attempts to specify source routing: 345 To: <@example.net:fran@example.com> 347 This means "send to fran@example.com via the mail service at 348 example.net". It can safely be interpreted as: 350 To: 352 7.1.2. Excessive Angle Brackets 354 The following over-use of angle brackets: 356 To: <<>> 358 can safely be interpreted as: 360 To: 362 7.1.3. Unbalanced Angle Brackets 364 The following use of unbalanced angle brackets: 366 To: 372 The following: 374 To: second@example.org> 376 can usually be treated as: 378 To: second@example.org 380 7.1.4. Unbalanced Parentheses 382 The following use of unbalanced parentheses: 384 To: (Testing 386 can safely be interpreted as: 388 To: (Testing) 390 Likewise, this case: 392 To: Testing) 394 can safely be interpreted as: 396 To: "Testing)" 398 In both cases, it is obvious where the active email address in the 399 string can be found. The former case retains the active email 400 address in the string by completing what appears to be intended as a 401 comment; the intent in the latter case is less obvious, so the 402 leading string is interpreted as a display name. 404 7.1.5. Commas in Address Lists 406 This use of an errant comma: 408 To: 410 can usually be interpreted as ending an address, so the above is 411 usually best interpreted as: 413 To: third@example.net, fourth@example.net 415 7.1.6. Unbalanced Quotes 417 The following use of unbalanced quotation marks: 419 To: "Joe 421 leaves software with no obvious "good" interpretation. If it is 422 essential to extract an address from the above, one possible 423 interpretation is: 425 To: "Joe "@example.net 427 where "example.net" is the domain name or host name of the handling 428 agent making the interpretation. Another possible interpretation, 429 much simpler and likely more correct, is simply: 431 To: "Joe" 433 7.1.7. Naked Local-Parts 435 [MAIL] defines a local-part as the user portion of an email address, 436 and the display-name as the "user-friendly" label that accompanies 437 the address specification. 439 Some broken submission agents might introduce messages with only a 440 local-part or only a display-name and no properly formed address. 441 For example: 443 To: Joe 445 A submission agent ought to reject this or, at a minimum, append "@" 446 followed by its own host name or some other valid name likely to 447 enable a reply to be delivered to the correct mailbox. Where this is 448 not done, an agent receiving such a message will probably be 449 successful by synthesizing a valid header field for evaluation using 450 the techniques described in Section 7.5.2. 452 7.2. Non-Header Lines 454 Some messages contain a line of text in the header that is not a 455 valid message header field of any kind. For example: 457 From: user@example.com {1} 458 To: userpal@example.net {2} 459 Subject: This is your reminder {3} 460 about the football game tonight {4} 461 Date: Wed, 20 Oct 2010 20:53:35 -0400 {5} 463 Don't forget to meet us for the tailgate party! {7} 465 The cause of this is typically a bug in a message generator of some 466 kind. Line {4} was intended to be a continuation of line {3}; it 467 should have been indented by whitespace as set out in Section 2.2.3 468 of [MAIL]. 470 This anomaly has varying impacts on processing software, depending on 471 the implementation: 473 1. some agents choose to separate the header of the message from the 474 body only at the first empty line (that is, a CRLF immediately 475 followed by another CRLF); 477 2. some agents assume this anomaly should be interpreted to mean the 478 body starts at line {4}, as the end of the header is assumed by 479 encountering something that is not a valid header field or folded 480 portion thereof; 482 3. some agents assume this should be interpreted as an intended 483 header folding as described above and thus simply append a single 484 space character (ASCII 0x20) and the content of line {4} to that 485 of line {3}; 487 4. some agents reject this outright as line {4} is neither a valid 488 header field nor a folded continuation of a header field prior to 489 an empty line. 491 This can be exploited if it is known that one message handling agent 492 will take one action while the next agent in the handling chain will 493 take another. Consider, for example, a message filter that searches 494 message headers for properties indicative of abusive of malicious 495 content that is attached to a Mail Transfer Agent (MTA) implementing 496 option 2 above. An attacker could craft a message that includes this 497 malformation at a position above the property of interest, knowing 498 the MTA will not consider that content part of the header, and thus 499 the MTA will not feed it to the filter, thus avoiding detection. 500 Meanwhile, the Mail User Agent (MUA) which presents the content to an 501 end user, implements option 1 or 3, which has some undesirable 502 effect. 504 It should be noted that a few implementations choose option 4 above 505 since any reputable message generation program will get header 506 folding right, and thus anything so blatant as this malformation is 507 likely an error caused by a malefactor. 509 The preferred implementation if option 4 above is not employed is to 510 apply the following heuristic when this malformation is detected: 512 1. Search forward for an empty line. If one is found, then apply 513 option 3 above to the anomalous line, and continue. 515 2. Search forward for another line that appears to be a new header 516 field (a name followed by a colon). If one is found, then apply 517 option 3 above to the anomalous line, and continue. 519 7.3. Unusual Spacing 521 The following message is valid per [MAIL]: 523 From: user@example.com {1} 524 To: userpal@example.net {2} 525 Subject: This is your reminder {3} 526 {4} 527 about the football game tonight {5} 528 Date: Wed, 20 Oct 2010 20:53:35 -0400 {6} 529 Don't forget to meet us for the tailgate party! {8} 531 Line {4} contains a single whitespace. The intended result is that 532 lines {3}, {4}, and {5} comprise a single continued header field. 533 However, some agents are aggressive at stripping trailing whitespace, 534 which will cause line {4} to be treated as an empty line, and thus 535 the separator line between header and body. This can affect header- 536 specific processing algorithms as described in the previous section. 538 This example was legal in earlier versions of the Internet Mail 539 format standard, but was rendered obsolete as of [RFC2822] as line 540 {4} could be interpreted as the separator between the header and 541 body. 543 The best handling of this example is for a message parsing engine to 544 behave as if line {4} was not present in the message and for a 545 message creation engine to emit the message with line {4} removed. 547 7.4. Header Malformations 549 Among the many possible malformations, a common one is insertion of 550 whitespace at unusual locations, such as: 552 From: user@example.com {1} 553 To: userpal@example.net {2} 554 Subject: This is your reminder {3} 555 MIME-Version : 1.0 {4} 556 Content-Type: text/plain {5} 557 Date: Wed, 20 Oct 2010 20:53:35 -0400 {6} 559 Don't forget to meet us for the tailgate party! {8} 561 Note the addition of whitespace in line {4} after the header field 562 name but before the colon that separates the name from the value. 564 The obsolete grammar of Section 4 of [MAIL] permits that extra 565 whitespace, so it cannot be considered invalid. However, a consensus 566 of implementations prefers to remove that whitespace. There is no 567 perceived change to the semantics of the header field being altered 568 as the whitespace is itself semantically meaningless. Therefore, it 569 is best to remove all whitespace after the field name but before the 570 colon and to emit the field in this modified form. 572 7.5. Header Field Counts 574 Section 3.6 of [MAIL] prescribes specific header field counts for a 575 valid message. Few agents actually enforce these in the sense that a 576 message whose header contents exceed one or more limits set there are 577 generally allowed to pass; they typically add any required fields 578 that are missing, however. 580 Also, few agents that use messages as input, including Mail User 581 Agents (MUAs) that actually display messages to users, verify that 582 the input is valid before proceeding. Some popular open source 583 filtering programs and some popular Mailing List Management (MLM) 584 packages select either the first or last instance of a particular 585 field name, such as From, to decide who sent a message. Absent 586 strict enforcement of [MAIL], an attacker can craft a message with 587 multiple instances of the same field fields if that attacker knows 588 the filter will make a decision based on one but the user will be 589 shown the others. 591 This situation is exacerbated when message validity is assessed, such 592 as through enhanced authentication methods like DomainKeys Identified 593 Mail [DKIM]. Such methods might cover one instance of a constrained 594 field but not another, taking the wrong one as "good" or "safe". An 595 MUA, for example could show the first of two From fields to an end 596 user as "good" or "safe" while an authentication method actually only 597 verified the second. 599 In attempting to counter this exposure, one of the following 600 strategies can be used: 602 1. reject outright or refuse to process further any input message 603 that does not conform to Section 3.6 of [MAIL]; 605 2. remove or, in the case of an MUA, refuse to render any instances 606 of a header field whose presence exceeds a limit prescribed in 607 Section 3.6 of [MAIL] when generating its output; 609 3. where a field has a limited instance count, combine additional 610 instances into a single instance carrying the same inforamtion as 611 the multiple instances; 613 4. where a field can contain multiple distinct values (such as From) 614 or is free-form text (such as Subject), combine them into a 615 semantically identical single header field of the same name (see 616 Section 7.5.1); 618 5. alter the name of any header field whose presence exceeds a limit 619 prescribed in Section 3.6 of [MAIL] when generating its output so 620 that later agents can produce a consistent result. Any 621 alteration likely to cause the field to be ignored by downstream 622 agents is acceptable. A common approach is to prefix the field 623 names with a string such as "BAD-". 625 Selecting a mitigation action from the above list, or some other 626 action, must consider the needs of the operator making the decision, 627 and the nature of its user base. 629 7.5.1. Repeated Header Fields 631 There are some occasions where repeated fields are encountered where 632 only one is expected. Two examples are presented. First: 634 From: reminders@example.com {1} 635 To: jqpublic@example.com {2} 636 Subject: Automatic Meeting Reminder {3} 637 Subject: 4pm Today -- Staff Meeting {4} 638 Date: Wed, 20 Oct 2010 08:00:00 -0700 {5} 640 Reminder of the staff meeting today in the small {6} 641 auditorium. Come early! {7} 643 The message above has two Subject fields, which is in violation of 644 Section 3.6 of [MAIL]. A safe interpretation of this would be to 645 treat it as though the two Subject field values were concatenated, so 646 long as they are not identical, such as: 648 From: reminders@example.com {1} 649 To: jqpublic@example.com {2} 650 Subject: Automatic Meeting Reminder {3} 651 4pm Today -- Staff Meeting {4} 652 Date: Wed, 20 Oct 2010 08:00:00 -0700 {5} 654 Reminder of the staff meeting today in the small {6} 655 auditorium. Come early! {7} 657 Second: 659 From: president@example.com {1} 660 From: vice-president@example.com {2} 661 To: jqpublic@example.com {3} 662 Subject: A note from the E-Team {4} 663 Date: Wed, 20 Oct 2010 08:00:00 -0700 {5} 665 This memo is to remind you of the corporate dress {6} 666 code. Attached you will find an updated copy of {7} 667 the policy. {8} 668 ... 670 As with the first example, there is a violation in terms of the 671 number of instances of the From field. A likely safe interpretation 672 would be to combine these into a comma-separated address list in a 673 single From field: 675 From: president@example.com, {1} 676 vice-president@example.com {2} 677 To: jqpublic@example.com {3} 678 Subject: A note from the E-Team {4} 679 Date: Wed, 20 Oct 2010 08:00:00 -0700 {5} 681 This memo is to remind you of the corporate dress {6} 682 code. Attached you will find an updated copy of {7} 683 the policy. {8} 684 ... 686 7.5.2. Missing Header Fields 688 Similar to the previous section, there are messages seen in the wild 689 that lack certain required header fields. In particular, [MAIL] 690 requires that a From and Date field be present in all messages. 692 When presented with a message lacking these fields, the MTA might 693 perform one of the following: 695 1. Make no changes 697 2. Add an instance of the missing field(s) using synthesized content 698 based on data provided in other parts of the protocol 700 Option 2 is recommended for handling this case. Handling agents 701 should add these for internal handling if they are missing, but 702 should not add them to the external representation. The reason for 703 this advice is that there are some filter modules that would consider 704 the absence of such fields to be a condition warranting special 705 treatment (for example, rejection), and thus the effectiveness of 706 such modules would be stymied by an upstream filter adding them in a 707 way visible to other components. 709 The synthesized fields should contain a best guess as to what should 710 have been there; for From, the SMTP MAIL command's address can be 711 used (if not null) or a placeholder address followed by an address 712 literal (for example, unknown@[192.0.2.1]); for Date, a date 713 extracted from a Received field is a reasonable choice. 715 One other important case to consider is a missing Message-Id field. 716 An MTA that encounters a message missing this field should synthesize 717 a valid one and add it to the external representation, since many 718 deployed tools use the content of that field as a common unique 719 message reference, so its absence inhibits correlation of message 720 processing. Section 3.6.4 of [MAIL] describes advisable practise for 721 synthesizing the content of this field when it is absent, and 722 establishes a requirement that it be globally unique. 724 7.5.3. Return-Path 726 A valid message will have exactly one Return-Path header field, as 727 per Section 4.4 of [SMTP]. Should a message be encountered bearing 728 more than one, all but the topmost one is to be disregarded, as it is 729 most likely to have been added nearest to the mailbox that received 730 that message. 732 7.6. Missing or Incorrect Charset Information 734 MIME provides the means to include textual material employing 735 character sets ("charsets") other than US-ASCII. Such material is 736 required to have an identified charset. Charset identification is 737 done using a "charset" parameter in the Content-Type header field, a 738 charset label within the MIME entity itself, or the charset can be 739 implicitly specified by the Content-Type (see [CHARSET]). 741 It is unfortunately fairly common for required character set 742 information to be missing or incorrect in textual MIME entities. As 743 such, processing agents should perform basic sanity checks, such as: 745 o US-ASCII contains bytes between 1 and 127 inclusive only 746 (colloquially, "7-bit" data), so material including bytes outside 747 of that range ("8-bit" data) is necessarily not US-ASCII. (See 748 Section 2.3.1 of [MAIL].) 750 o [UTF-8] has a very specific syntactic structure that other 8-bit 751 charsets are unlikely to follow. 753 o Null bytes (ASCII 0x00) are not allowed in either 7-bit or 8-bit 754 data. 756 o Not all 7-bit material is US-ASCII. The presence of the various 757 escape sequences used for character switching can be used as an 758 indication of the various charsets based on ISO/IEC 2022, such as 759 those defined in [ISO-2022-CN], [ISO-2022-JP], and [ISO-2022-KR]. 761 When a character set error is detected, processing agents should: 763 a. apply heuristics to determine the most likely character set and, 764 if successful, proceed using that information; or 766 b. refuse to process the malformed MIME entity. 768 A null byte inside a textual MIME entity can cause typical string 769 processing functions to mis-identify the end of a string, which can 770 be exploited to hide malicious content from analysis processes. 771 Accordingly, null bytes require additional special handling. 773 A few null bytes in isolation is likely to be the result of poor 774 message construction practices. Such nulls should be silently 775 dropped. 777 Large numbers of null bytes are usually the result of binary material 778 that is improperly encoded, improperly labeled, or both. Such 779 material is likely to be damaged beyond the hope of recovery, so the 780 best course of action is to refuse to process it. 782 Finally, the presence of null bytes may be used as indication of 783 possible malicious intent. 785 7.7. Eight-Bit Data 787 Standards-compliant email messages do not contain any non-ASCII data 788 without indicating that such content is present by means of published 789 SMTP extensions. Absent that, MIME encodings are typically used to 790 convert non-ASCII data to ASCII in a way that can be reversed by 791 other handling agents or end users. 793 The best way to handle non-compliant 8bit material depends on its 794 location. 796 Non-compliant 8bit material in MIME entity content should simply be 797 processed as if the necessary SMTP extensions had been used to 798 transfer the message. Note that improperly labeled 8bit material in 799 textual MIME entities may require treatment as described in 800 Section 7.6. 802 Non-compliant 8bit material in message or MIME entity header fields 803 can be handled as follows: 805 o Occurrences in unstructured text fields, comments, and phrases, 806 can be converted into encoded-words (see [MIME3] if a likely 807 character set can be determined). Alternatively, 8bit characters 808 can be removed or replaced with some other character. 810 o Occurrences in header fields whose syntax is unknown may be 811 handled by dropping the field entirely or by removing/replacing 812 the 8bit character as described above. 814 o Occurrences in addresses are especially problematic. Agents 815 supporting [EAI] may, if the 8bit material conforms to 8bit 816 syntax, elect to treat the message as an EAI message and process 817 it accordingly. Otherwise, it is in most cases best to exclude 818 the address from any sort of processing -- which may mean dropping 819 it entirely -- since any attempt to fix it definitively is 820 unlikely to be successful. 822 8. MIME Anomalies 824 The five-part set of MIME specifications includes a mechanism of 825 message extensions for providing text in character sets other than 826 ASCII, non-text attachments to messages, multi-part message bodies, 827 and similar facilities. 829 Some anomalies with MIME-compliant generation are also common. This 830 section discusses some of those and presents preferred mitigations. 832 8.1. Missing MIME-Version Field 834 Any message that uses [MIME] constructs is required to have a MIME- 835 Version header field. Without it, the Content-Type and associated 836 fields have no semantic meaning. 838 It is often observed that a message has complete MIME structure, yet 839 lacks this header field. It is prudent to disregard this absence and 840 conduct analysis of the message as if it were present, especially by 841 agents attempting to identify malicious material. 843 Further, the absence of MIME-Version might be an indication of 844 malicious intent, and extra scrutiny of the message may be warranted. 845 Such omissions are not expected from compliant message generators. 847 8.2. Faulty Encodings 849 There have been a few different specifications of base64 in the past. 850 The implementation defined in [MIME] instructs decoders to discard 851 characters that are not part of the base64 alphabet. Other 852 implementations consider an encoded body containing such characters 853 to be completely invalid. Very early specifications of base64 (see 854 [PEM], for example) allowed email-style comments within base64- 855 encoded data. 857 The attack vector here involves constructing a base64 body whose 858 meaning varies given different possible decodings. If a security 859 analysis module wishes to be thorough, it should consider scanning 860 the possible outputs of the known decoding dialects in an attempt to 861 anticipate how the MUA will interpret the data. 863 9. Body Anomalies 865 9.1. Oversized Lines 867 A message containing a line of content that exceeds 998 characters 868 plus the line terminator (1000 total) violates Section 2.1.1 of 869 [MAIL]. Some handling agents may not look at content in a single 870 line past the first 998 bytes, providing bad actors an opportunity to 871 hide malicious content. 873 There is no specified way to handle such messages, other than to 874 observe that they are non-compliant and reject them, or rewrite the 875 oversized line such that the message is compliant. 877 To ensure long lines do not prevent analysis of potentially malicious 878 data, handling agents are strongly encouraged to take one of the 879 following actions: 881 1. Break such lines into multiple lines at a position that does not 882 change the semantics of the text being thus altered. For 883 example, breaking an oversized line such that a [URI] then spans 884 two lines could inhibit the proper identification of that URI. 886 2. Rewrite the MIME part (or the entire message if not MIME) that 887 contains the excessively long line using a content encoding that 888 breaks the line in the transmission but would still result in the 889 line being intact on decoding for presentation to the user. Both 890 of the encodings declared in [MIME] can accomplish this. 892 10. Security Considerations 894 The discussions of the anomalies above and their prescribed solutions 895 are themselves security considerations. The practises enumerated in 896 this document are generally perceived as attempts to resolve security 897 considerations that already exist rather than introducing new ones. 898 However, some of the attacks described here may not have appeared in 899 previous email specifications. 901 11. IANA Considerations 903 This document contains no actions for IANA. 905 [RFC Editor: Please remove this section prior to publication.] 907 12. References 908 12.1. Normative References 910 [EMAIL-ARCH] Crocker, D., "Internet Mail Architecture", RFC 5598, 911 July 2009. 913 [MAIL] Resnick, P., "Internet Message Format", RFC 5322, 914 October 2008. 916 [MIME] Freed, N. and N. Borenstein, "Multipurpose Internet 917 Mail Extensions (MIME) Part One: Format of Internet 918 Message Bodies", RFC 2045, November 1996. 920 12.2. Informative References 922 [BINARYSMTP] Vaudreuil, G., "SMTP Service Extensions for 923 Transmission of Large and Binary MIME Messages", 924 RFC 3030, December 2000. 926 [CHARSET] Melnikov, A. and J. Reschke, "Update to MIME regarding 927 "charset" Parameter Handling in Textual Media Types", 928 RFC 6657, July 2012. 930 [DKIM] Crocker, D., Ed., Hansen, T., Ed., and M. Kucherawy, 931 Ed., "DomainKeys Identified Mail (DKIM) Signatures", 932 RFC 6376, September 2011. 934 [DSN] Moore, K. and G. Vaudreuil, "An Extensible Message 935 Format for Delivery Status Notifications", RFC 3464, 936 January 2003. 938 [EAI] Yang, A., Steele, S., and N. Freed, "Internationalized 939 Email Headers", RFC 6532, February 2012. 941 [ISO-2022-CN] Zhu, HF., Hu, DY., Wang, ZG., Kao, TC., Chang, WCH., 942 and M. Crispin, "Chinese Character Encoding for 943 Internet Messages", RFC 1922, March 1996. 945 [ISO-2022-JP] Murai, J., Crispin, M., and E. van der Poel, "Japanese 946 Character Encoding for Internet Messages", RFC 1468, 947 June 1993. 949 [ISO-2022-KR] Choi, U., Chon, K., and H. Park, "Korean Character 950 Encoding for Internet Messages", RFC 1557, 951 December 1993. 953 [MIME3] Moore, K., "MIME (Multipurpose Internet Mail 954 Extensions) Part Three: Message Header Extensions for 955 Non-ASCII Text", RFC 2047, November 1996. 957 [PEM] Linn, J., "Privacy Enhancement for Internet Electronic 958 Mail: Part I -- Message Encipherment and 959 Authentication Procedures", RFC 1113, August 1989. 961 [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts -- 962 Communication Layers", RFC 1122, October 1989. 964 [RFC2822] Resnick, P., Ed., "Internet Message Format", RFC 2822, 965 April 2001. 967 [RFC733] Crocker, D., Vittal, J., Pogran, K., and D. Henderson, 968 Jr., "Standard for the Format of Internet Text 969 Messages", RFC 733, November 1977. 971 [SMTP] Klensin, J., "Simple Mail Transfer Protocol", 972 RFC 5321, October 2008. 974 [URI] Berners-Lee, T., Fielding, R., and L. Masinter, 975 "Uniform Resource Identifier (URI): Generic Syntax", 976 RFC 3986, January 2005. 978 [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 979 10646", RFC 3629, 2003. 981 Appendix A. RFC Editor Notes 983 [RFC Editor Note: This section can be removed before publication.] 985 I can't seem to figure out how to do this with xml2rfc, but the ISO- 986 2022 reference above should contain the following URI: 987 http://www.iso.org/iso/catalogue_detail.htm?csnumber=22747 989 Appendix B. Acknowledgements 991 The author wishes to acknowledge the following for their review and 992 constructive criticism of this proposal: Dave Cridland, Dave Crocker, 993 Jim Galvin, Tony Hansen, John Levine, Franck Martin, Alexey Melnikov, 994 and Timo Sirainen 996 Authors' Addresses 998 Murray S. Kucherawy 1000 EMail: superuser@gmail.com 1001 Gregory N. Shapiro 1003 EMail: gshapiro@proofpoint.com 1005 N. Freed 1007 EMail: ned.freed@mrochek.com