idnits 2.17.1 draft-gont-tcp-security-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 2 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. == There are 5 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 20, 2009) is 5515 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'Clark' on line 5448 looks like a reference -- Missing reference section? '1988' on line 5490 looks like a reference -- Missing reference section? 'Bellovin' on line 4739 looks like a reference -- Missing reference section? '1989' on line 5501 looks like a reference -- Missing reference section? 'NISCC' on line 5348 looks like a reference -- Missing reference section? '2005' on line 629 looks like a reference -- Missing reference section? 'Silbersack' on line 629 looks like a reference -- Missing reference section? 'Postel' on line 5582 looks like a reference -- Missing reference section? '1981c' on line 5143 looks like a reference -- Missing reference section? 'Braden' on line 5501 looks like a reference -- Missing reference section? 'IANA' on line 1576 looks like a reference -- Missing reference section? '2008' on line 5876 looks like a reference -- Missing reference section? 'Jones' on line 901 looks like a reference -- Missing reference section? '2003' on line 3387 looks like a reference -- Missing reference section? 'Gont' on line 5758 looks like a reference -- Missing reference section? '2008a' on line 5758 looks like a reference -- Missing reference section? 'Touch' on line 4037 looks like a reference -- Missing reference section? '2007' on line 5781 looks like a reference -- Missing reference section? 'Watson' on line 5347 looks like a reference -- Missing reference section? '2004' on line 5348 looks like a reference -- Missing reference section? 'Allman' on line 639 looks like a reference -- Missing reference section? '1996' on line 5011 looks like a reference -- Missing reference section? 'CERT' on line 3387 looks like a reference -- Missing reference section? 'Meltman' on line 922 looks like a reference -- Missing reference section? '1997' on line 2354 looks like a reference -- Missing reference section? 'Morris' on line 955 looks like a reference -- Missing reference section? '1985' on line 955 looks like a reference -- Missing reference section? 'Shimomura' on line 2424 looks like a reference -- Missing reference section? '1995' on line 5503 looks like a reference -- Missing reference section? '2001' on line 4705 looks like a reference -- Missing reference section? 'US-CERT' on line 4362 looks like a reference -- Missing reference section? 'Zalewski' on line 5818 looks like a reference -- Missing reference section? '2001a' on line 996 looks like a reference -- Missing reference section? '2002' on line 3264 looks like a reference -- Missing reference section? '1987' on line 1074 looks like a reference -- Missing reference section? '1992' on line 1074 looks like a reference -- Missing reference section? '2001b' on line 1139 looks like a reference -- Missing reference section? 'Heffner' on line 3264 looks like a reference -- Missing reference section? 'Barisani' on line 1387 looks like a reference -- Missing reference section? '2006' on line 5298 looks like a reference -- Missing reference section? 'Ed3f' on line 1403 looks like a reference -- Missing reference section? '2005d' on line 2210 looks like a reference -- Missing reference section? 'Myst' on line 1481 looks like a reference -- Missing reference section? 'Cisco' on line 1515 looks like a reference -- Missing reference section? 'Hoenes' on line 1580 looks like a reference -- Missing reference section? '1994' on line 1829 looks like a reference -- Missing reference section? 'CCSDS' on line 1645 looks like a reference -- Missing reference section? 'Stevens' on line 1829 looks like a reference -- Missing reference section? 'Reed' on line 1865 looks like a reference -- Missing reference section? '1981a' on line 5582 looks like a reference -- Missing reference section? 'Heffernan' on line 5350 looks like a reference -- Missing reference section? '1998' on line 5350 looks like a reference -- Missing reference section? 'Welzl' on line 2107 looks like a reference -- Missing reference section? '2008c' on line 2193 looks like a reference -- Missing reference section? '2005c' on line 2204 looks like a reference -- Missing reference section? '2008b' on line 2337 looks like a reference -- Missing reference section? 'Borman' on line 2354 looks like a reference -- Missing reference section? 'Eddy' on line 2354 looks like a reference -- Missing reference section? 'Lemon' on line 2357 looks like a reference -- Missing reference section? 'Bernstein' on line 2442 looks like a reference -- Missing reference section? 'Zuquete' on line 2432 looks like a reference -- Missing reference section? 'CPNI' on line 5883 looks like a reference -- Missing reference section? '2000' on line 5753 looks like a reference -- Missing reference section? '2003b' on line 5818 looks like a reference -- Missing reference section? 'Linux' on line 2733 looks like a reference -- Missing reference section? 'Shalunov' on line 3007 looks like a reference -- Missing reference section? '2004a' on line 3143 looks like a reference -- Missing reference section? 'CORE' on line 3387 looks like a reference -- Missing reference section? '2005a' on line 3542 looks like a reference -- Missing reference section? 'Ostermann' on line 3712 looks like a reference -- Missing reference section? 'PCNWG' on line 4066 looks like a reference -- Missing reference section? '2009' on line 5883 looks like a reference -- Missing reference section? '2003a' on line 4362 looks like a reference -- Missing reference section? 'Fyodor' on line 4988 looks like a reference -- Missing reference section? '2006b' on line 4988 looks like a reference -- Missing reference section? 'TBIT' on line 4698 looks like a reference -- Missing reference section? '2006a' on line 4702 looks like a reference -- Missing reference section? 'Miller' on line 4703 looks like a reference -- Missing reference section? 'Beck' on line 4705 looks like a reference -- Missing reference section? 'Rowland' on line 4871 looks like a reference -- Missing reference section? 'Zander' on line 4874 looks like a reference -- Missing reference section? 'Maimon' on line 5011 looks like a reference -- Missing reference section? '1982' on line 5448 looks like a reference -- Missing reference section? '1981b' on line 5171 looks like a reference -- Missing reference section? 'Baker' on line 5503 looks like a reference -- Missing reference section? 'Jacobson' on line 5490 looks like a reference -- Missing reference section? 'PMTUDWG' on line 5588 looks like a reference -- Missing reference section? 'Lahey' on line 5753 looks like a reference Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 90 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group F. Gont 3 Internet-Draft UK CPNI 4 Intended status: BCP February 20, 2009 5 Expires: August 24, 2009 7 Security Assessment of the Transmission Control Protocol (TCP) 8 draft-gont-tcp-security-00.txt 10 Status of this Memo 12 This Internet-Draft is submitted to IETF in full conformance with the 13 provisions of BCP 78 and BCP 79. This document may not be modified, 14 and derivative works of it may not be created, except to format it 15 for publication as an RFC and to translate it into languages other 16 than English. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on August 24, 2009. 36 Copyright Notice 38 Copyright (c) 2009 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. 48 Abstract 50 This document contains a security assessment of the IETF 51 specifications of the Transmission Control Protocol (TCP), and of a 52 number of mechanisms and policies in use by popular TCP 53 implementations. It is based on the results of a project carried out 54 by the UK's Centre for the Protection of National Infrastructure 55 (CPNI). 57 Table of Contents 59 1. Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 60 1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 6 61 1.2. Scope of this document . . . . . . . . . . . . . . . . . 7 62 1.3. Organization of this document . . . . . . . . . . . . . . 9 63 2. The Transmission Control Protocol . . . . . . . . . . . . . . 9 64 3. TCP header fields . . . . . . . . . . . . . . . . . . . . . . 10 65 3.1. Source Port . . . . . . . . . . . . . . . . . . . . . . . 11 66 3.1.1. Problems that may arise as a result of collisions 67 of connection-id's . . . . . . . . . . . . . . . . . 12 68 3.1.2. Port randomization algorithms . . . . . . . . . . . . 14 69 3.1.3. TCP ephemeral port range . . . . . . . . . . . . . . 19 70 3.2. Destination port . . . . . . . . . . . . . . . . . . . . 20 71 3.3. Sequence number . . . . . . . . . . . . . . . . . . . . . 21 72 3.3.1. Generation of Initial Sequence Numbers . . . . . . . 21 73 3.4. Acknowledgement Number . . . . . . . . . . . . . . . . . 23 74 3.5. Data Offset . . . . . . . . . . . . . . . . . . . . . . . 24 75 3.6. Control bits . . . . . . . . . . . . . . . . . . . . . . 24 76 3.6.1. Reserved (four bits) . . . . . . . . . . . . . . . . 24 77 3.6.2. CWR (Congestion Window Reduced) . . . . . . . . . . . 25 78 3.6.3. ECE (ECN-Echo) . . . . . . . . . . . . . . . . . . . 25 79 3.6.4. URG . . . . . . . . . . . . . . . . . . . . . . . . . 25 80 3.6.5. ACK . . . . . . . . . . . . . . . . . . . . . . . . . 26 81 3.6.6. PSH . . . . . . . . . . . . . . . . . . . . . . . . . 26 82 3.6.7. RST . . . . . . . . . . . . . . . . . . . . . . . . . 27 83 3.6.8. SYN . . . . . . . . . . . . . . . . . . . . . . . . . 28 84 3.6.9. FIN . . . . . . . . . . . . . . . . . . . . . . . . . 28 85 3.7. Window . . . . . . . . . . . . . . . . . . . . . . . . . 28 86 3.7.1. Security implications of the maximum TCP window 87 size . . . . . . . . . . . . . . . . . . . . . . . . 29 88 3.7.2. Security implications arising from closed windows . . 29 89 3.8. Checksum . . . . . . . . . . . . . . . . . . . . . . . . 30 90 3.9. Urgent pointer . . . . . . . . . . . . . . . . . . . . . 31 91 3.9.1. Security implications arising from ambiguities in 92 the processing of urgent indications . . . . . . . . 33 93 3.9.2. Security implications arising from the 94 implementation of the urgent mechanism as "out of 95 band" data . . . . . . . . . . . . . . . . . . . . . 34 96 3.10. Options . . . . . . . . . . . . . . . . . . . . . . . . . 35 97 3.11. Padding . . . . . . . . . . . . . . . . . . . . . . . . . 39 98 3.12. Data . . . . . . . . . . . . . . . . . . . . . . . . . . 39 99 4. Common TCP Options . . . . . . . . . . . . . . . . . . . . . 39 100 4.1. End of Option List (Kind = 0) . . . . . . . . . . . . . . 39 101 4.2. No Operation (Kind = 1) . . . . . . . . . . . . . . . . . 39 102 4.3. Maximum Segment Size (Kind = 2) . . . . . . . . . . . . . 39 103 4.4. Selective Acknowledgement Option . . . . . . . . . . . . 42 104 4.4.1. SACK-permitted Option (Kind = 4) . . . . . . . . . . 42 105 4.4.2. SACK Option (Kind = 5) . . . . . . . . . . . . . . . 43 106 4.5. MD5 Option (Kind=19) . . . . . . . . . . . . . . . . . . 44 107 4.6. Window scale option (Kind = 3) . . . . . . . . . . . . . 46 108 4.7. Timestamps option (Kind = 8) . . . . . . . . . . . . . . 47 109 4.7.1. Generation of timestamps . . . . . . . . . . . . . . 47 110 4.7.2. Vulnerabilities . . . . . . . . . . . . . . . . . . . 48 111 5. Connection-establishment mechanism . . . . . . . . . . . . . 49 112 5.1. SYN flood . . . . . . . . . . . . . . . . . . . . . . . . 49 113 5.2. Connection forgery . . . . . . . . . . . . . . . . . . . 53 114 5.3. Connection-flooding attack . . . . . . . . . . . . . . . 54 115 5.3.1. Vulnerability . . . . . . . . . . . . . . . . . . . . 54 116 5.3.2. Countermeasures . . . . . . . . . . . . . . . . . . . 55 117 5.4. Firewall-bypassing techniques . . . . . . . . . . . . . . 57 118 6. Connection-termination mechanism . . . . . . . . . . . . . . 57 119 6.1. FIN-WAIT-2 flooding attack . . . . . . . . . . . . . . . 57 120 6.1.1. Vulnerability . . . . . . . . . . . . . . . . . . . . 58 121 6.1.2. Countermeasures . . . . . . . . . . . . . . . . . . . 58 122 7. Buffer management . . . . . . . . . . . . . . . . . . . . . . 61 123 7.1. TCP retransmission buffer . . . . . . . . . . . . . . . . 61 124 7.1.1. Vulnerability . . . . . . . . . . . . . . . . . . . . 61 125 7.1.2. Countermeasures . . . . . . . . . . . . . . . . . . . 62 126 7.2. TCP segment reassembly buffer . . . . . . . . . . . . . . 65 127 7.3. Automatic buffer tuning mechanisms . . . . . . . . . . . 67 128 7.3.1. Automatic send-buffer tuning mechanisms . . . . . . . 68 129 7.3.2. Automatic receive-buffer tuning mechanism . . . . . . 70 130 8. TCP segment reassembly algorithm . . . . . . . . . . . . . . 72 131 8.1. Problems that arise from ambiguity in the reassembly 132 process . . . . . . . . . . . . . . . . . . . . . . . . . 72 133 9. TCP Congestion Control . . . . . . . . . . . . . . . . . . . 73 134 9.1. Congestion control with misbehaving receivers . . . . . . 74 135 9.1.1. ACK division . . . . . . . . . . . . . . . . . . . . 74 136 9.1.2. DupACK forgery . . . . . . . . . . . . . . . . . . . 75 137 9.1.3. Optimistic ACKing . . . . . . . . . . . . . . . . . . 75 138 9.2. Blind DupACK triggering attacks against TCP . . . . . . . 76 139 9.2.1. Blind throughput-reduction attack . . . . . . . . . . 78 140 9.2.2. Blind flooding attack . . . . . . . . . . . . . . . . 78 141 9.2.3. Difficulty in performing the attacks . . . . . . . . 79 142 9.2.4. Modifications to TCP's loss recovery algorithms . . . 80 143 9.2.5. Countermeasures . . . . . . . . . . . . . . . . . . . 82 144 9.3. TCP Explicit Congestion Notification (ECN) . . . . . . . 86 145 9.3.1. Possible attacks by a compromised router . . . . . . 87 146 9.3.2. Possible attacks by a malicious TCP endpoint . . . . 87 147 10. TCP API . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 148 10.1. Passive opens and binding sockets . . . . . . . . . . . . 89 149 10.2. Active opens and binding sockets . . . . . . . . . . . . 90 150 11. Blind in-window attacks . . . . . . . . . . . . . . . . . . . 91 151 11.1. Blind TCP-based connection-reset attacks . . . . . . . . 91 152 11.1.1. RST flag . . . . . . . . . . . . . . . . . . . . . . 92 153 11.1.2. SYN flag . . . . . . . . . . . . . . . . . . . . . . 93 154 11.1.3. Security/Compartment . . . . . . . . . . . . . . . . 97 155 11.1.4. Precedence . . . . . . . . . . . . . . . . . . . . . 98 156 11.1.5. Illegal options . . . . . . . . . . . . . . . . . . . 99 157 11.2. Blind data-injection attacks . . . . . . . . . . . . . . 99 158 12. Information leaking . . . . . . . . . . . . . . . . . . . . . 99 159 12.1. Remote Operating System detection via TCP/IP stack 160 fingerprinting . . . . . . . . . . . . . . . . . . . . . 99 161 12.1.1. FIN probe . . . . . . . . . . . . . . . . . . . . . . 100 162 12.1.2. Bogus flag test . . . . . . . . . . . . . . . . . . . 100 163 12.1.3. TCP ISN sampling . . . . . . . . . . . . . . . . . . 101 164 12.1.4. TCP initial window . . . . . . . . . . . . . . . . . 101 165 12.1.5. RST sampling . . . . . . . . . . . . . . . . . . . . 101 166 12.1.6. TCP options . . . . . . . . . . . . . . . . . . . . . 102 167 12.1.7. Retransmission Timeout (RTO) sampling . . . . . . . . 102 168 12.2. System uptime detection . . . . . . . . . . . . . . . . . 103 169 13. Covert channels . . . . . . . . . . . . . . . . . . . . . . . 103 170 14. TCP Port scanning . . . . . . . . . . . . . . . . . . . . . . 104 171 14.1. Traditional connect() scan . . . . . . . . . . . . . . . 104 172 14.2. SYN scan . . . . . . . . . . . . . . . . . . . . . . . . 104 173 14.3. FIN, NULL, and XMAS scans . . . . . . . . . . . . . . . . 105 174 14.4. Maimon scan . . . . . . . . . . . . . . . . . . . . . . . 106 175 14.5. Window scan . . . . . . . . . . . . . . . . . . . . . . . 106 176 14.6. ACK scan . . . . . . . . . . . . . . . . . . . . . . . . 107 177 15. Processing of ICMP error messages by TCP . . . . . . . . . . 107 178 15.1. Internet Control Message Protocol . . . . . . . . . . . . 108 179 15.1.1. Internet Control Message Protocol for IP version 4 180 (ICMP) . . . . . . . . . . . . . . . . . . . . . . . 108 181 15.1.2. Internet Control Message Protocol for IP version 6 182 (ICMPv6) . . . . . . . . . . . . . . . . . . . . . . 109 183 15.2. Handling of ICMP error messages . . . . . . . . . . . . . 109 184 15.3. Constraints in the possible solutions . . . . . . . . . . 110 185 15.4. General countermeasures against ICMP attacks . . . . . . 111 186 15.4.1. TCP sequence number checking . . . . . . . . . . . . 111 187 15.4.2. Port randomization . . . . . . . . . . . . . . . . . 112 188 15.4.3. Filtering ICMP error messages based on the ICMP 189 payload . . . . . . . . . . . . . . . . . . . . . . . 112 190 15.5. Blind connection-reset attack . . . . . . . . . . . . . . 112 191 15.5.1. Description . . . . . . . . . . . . . . . . . . . . . 112 192 15.5.2. Attack-specific countermeasures . . . . . . . . . . . 114 193 15.6. Blind throughput-reduction attack . . . . . . . . . . . . 116 194 15.6.1. Description . . . . . . . . . . . . . . . . . . . . . 116 195 15.6.2. Attack-specific countermeasures . . . . . . . . . . . 117 196 15.7. Blind performance-degrading attack . . . . . . . . . . . 117 197 15.7.1. Description . . . . . . . . . . . . . . . . . . . . . 117 198 15.7.2. Attack-specific countermeasures . . . . . . . . . . . 119 199 16. TCP interaction with the Internet Protocol (IP) . . . . . . . 122 200 16.1. TCP-based traceroute . . . . . . . . . . . . . . . . . . 122 201 16.2. Blind TCP data injection through fragmented IP traffic . 123 202 16.3. Broadcast and multicast IP addresses . . . . . . . . . . 124 203 17. Security Considerations . . . . . . . . . . . . . . . . . . . 124 204 18. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 125 205 19. References . . . . . . . . . . . . . . . . . . . . . . . . . 125 206 Appendix A. TODO list . . . . . . . . . . . . . . . . . . . . . 135 207 Appendix B. Advice and guidance to vendors . . . . . . . . . . . 135 208 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 136 210 1. Preface 212 1.1. Introduction 214 The TCP/IP protocol suite was conceived in an environment that was 215 quite different from the hostile environment they currently operate 216 in. However, the effectiveness of the protocols led to their early 217 adoption in production environments, to the point that, to some 218 extent, the current world's economy depends on them. 220 While many textbooks and articles have created the myth that the 221 Internet protocols were designed for warfare environments, the top 222 level goal for the DARPA Internet Program was the sharing of large 223 service machines on the ARPANET [Clark, 1988]. As a result, many 224 protocol specifications focus only on the operational aspects of the 225 protocols they specify, and overlook their security implications. 227 While the Internet technology evolved since it early inception, the 228 Internet's building blocks are basically the same core protocols 229 adopted by the ARPANET more than two decades ago. During the last 230 twenty years, many vulnerabilities have been identified in the TCP/IP 231 stacks of a number of systems. Some of them were based on flaws in 232 some protocol implementations, affecting only a reduced number of 233 systems, while others were based in flaws in the protocols 234 themselves, affecting virtually every existing implementation 235 [Bellovin, 1989]. Even in the last couple of years, researchers were 236 still working on security problems in the core protocols [NISCC, 237 2004] [NISCC, 2005]. 239 The discovery of vulnerabilities in the TCP/IP protocol suite usually 240 led to reports being published by a number of CSIRTs (Computer 241 Security Incident Response Teams) and vendors, which helped to raise 242 awareness about the threats and the best mitigations known at the 243 time the reports were published. Unfortunately, this also led to the 244 documentation of the discovered protocol vulnerabilities being spread 245 among a large number of documents, which are sometimes difficult to 246 identify. 248 For some reason, much of the effort of the security community on the 249 Internet protocols did not result in official documents (RFCs) being 250 issued by the IETF (Internet Engineering Task Force). This basically 251 led to a situation in which "known" security problems have not always 252 been addressed by all vendors. In addition, in many cases vendors 253 have implemented quick "fixes" to the identified vulnerabilities 254 without a careful analysis of their effectiveness and their impact on 255 interoperability [Silbersack, 2005]. 257 Producing a secure TCP/IP implementation nowadays is a very difficult 258 task, in part because of the lack of a single document that serves as 259 a security roadmap for the protocols. Implementers are faced with 260 the hard task of identifying relevant documentation and 261 differentiating between that which provides correct advice, and that 262 which provides misleading advice based on inaccurate or wrong 263 assumptions. 265 There is a clear need for a companion document to the IETF 266 specifications that discusses the security aspects and implications 267 of the protocols, identifies the existing vulnerabilities, discusses 268 the possible countermeasures, and analyzes their respective 269 effectiveness. 271 This document is the result of a security assessment of the IETF 272 specifications of the Transmission Control Protocol (TCP), from a 273 security point of view. Possible threats are identified and, where 274 possible, countermeasures are proposed. Additionally, many 275 implementation flaws that have led to security vulnerabilities have 276 been referenced in the hope that future implementations will not 277 incur the same problems. 279 This document does not aim to be the final word on the security 280 aspects of TCP. On the contrary, it aims to raise awareness about a 281 number of TCP vulnerabilities that have been faced in the past, those 282 that are currently being faced, and some of those that we may still 283 have to deal with in the future. 285 Feedback from the community is more than encouraged to help this 286 document be as accurate as possible and to keep it updated as new 287 vulnerabilities are discovered. 289 This document is heavily based on the "Security Assessment of the 290 Transmission Control Protocol (TCP)" released by the UK Centre for 291 the Protection of National Infrastructure (CPNI), available at: http: 292 //www.cpni.gov.uk/Products/technicalnotes/ 293 Feb-09-security-assessment-TCP.aspx . 295 1.2. Scope of this document 297 While there are a number of protocols that may affect the way TCP 298 operates, this document focuses only on the specifications of the 299 Transmission Control Protocol (TCP) itself. 301 The following IETF RFCs were selected for assessment as part of this 302 work: 304 o RFC 793, "Transmission Control Protocol. DARPA Internet Program. 305 Protocol Specification" (91 pages) 307 o RFC 1122, "Requirements for Internet Hosts -- Communication 308 Layers" (116 pages) 310 o RFC 1191, "Path MTU Discovery" (19 pages) 312 o RFC 1323, "TCP Extensions for High Performance" (37 pages) 314 o RFC 1948, "Defending Against Sequence Number Attacks" (6 pages) 316 o RFC 1981, "Path MTU Discovery for IP version 6" (15 pages) 318 o RFC 2018, "TCP Selective Acknowledgment Options" (12 pages) 320 o RFC 2385, "Protection of BGP Sessions via the TCP MD5 Signature 321 Option" (6 pages) 323 o RFC 2581, "TCP Congestion Control" (14 pages) 325 o RFC 2675, "IPv6 Jumbograms" (9 pages) 327 o RFC 2883, "An Extension to the Selective Acknowledgement (SACK) 328 Option for TCP" (17 pages) 330 o RFC 2884, "Performance Evaluation of Explicit Congestion 331 Notification (ECN) in IP Networks" (18 pages) 333 o RFC 2988, "Computing TCP's Retransmission Timer" (8 pages) 335 o RFC 3168, "The Addition of Explicit Congestion Notification (ECN) 336 to IP" (63 pages) 338 o RFC 3465, "TCP Congestion Control with Appropriate Byte Counting 339 (ABC)" (10 pages) 341 o RFC 3517, "A Conservative Selective Acknowledgment (SACK)-based 342 Loss Recovery Algorithm for TCP" (13 pages) 344 o RFC 3540, "Robust Explicit Congestion Notification (ECN) Signaling 345 with Nonces" (13 pages) 347 o RFC 3782, "The NewReno Modification to TCP's Fast Recovery 348 Algorithm" (19 pages) 350 1.3. Organization of this document 352 This document is basically organized in two parts. The first part 353 contains a discussion of each of the TCP header fields, identifies 354 their security implications, and discusses the possible 355 countermeasures. The second part contains an analysis of the 356 security implications of the mechanisms and policies implemented by 357 TCP, and of a number of implementation strategies in use by a number 358 of popular TCP implementations. 360 2. The Transmission Control Protocol 362 The Transmission Control Protocol (TCP) is a connection-oriented 363 transport protocol that provides a reliable byte-stream data transfer 364 service. 366 Very few assumptions are made about the reliability of underlying 367 data transfer services below the TCP layer. Basically, TCP assumes 368 it can obtain a simple, potentially unreliable datagram service from 369 the lower level protocols. Figure 1 illustrates where TCP fits in 370 the DARPA reference model. 372 +---------------+ 373 | Application | 374 +---------------+ 375 | TCP | 376 +---------------+ 377 | IP | 378 +---------------+ 379 | Network | 380 +---------------+ 382 Figure 1: TCP in the DARPA reference model 384 TCP provides facilities in the following areas: 386 o Basic Data Transfer 388 o Reliability 390 o Flow Control 392 o Multiplexing 394 o Connections 395 o Precedence and Security 397 o Congestion Control 399 The core TCP specification, RFC 793 [Postel, 1981c], dates back to 400 1981 and standardizes the basic mechanisms and policies of TCP. RFC 401 1122 [Braden, 1989] provides clarifications and errata for the 402 original specification. RFC 2581 [Allman et al, 1999] specifies TCP 403 congestion control and avoidance mechanisms, not present in the 404 original specification. Other documents specify extensions and 405 improvements for TCP. 407 The large amount of documents that specify extensions, improvements, 408 or modifications to existing TCP mechanisms has led the IETF to 409 publish a roadmap for TCP, RFC 4614 [Duke et al, 2006], that 410 clarifies the relevance of each of those documents. 412 3. TCP header fields 414 RFC 793 [Postel, 1981c] defines the syntax of a TCP segment, along 415 with the semantics of each of the header fields. Figure 2 416 illustrates the syntax of a TCP segment. 418 0 1 2 3 419 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 420 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 421 | Source Port | Destination Port | 422 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 423 | Sequence Number | 424 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 425 | Acknowledgment Number | 426 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 427 | Data | |C|E|U|A|P|R|S|F| | 428 | Offset|Resrved|W|C|R|C|S|S|Y|I| Window | 429 | | |R|E|G|K|H|T|N|N| | 430 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 431 | Checksum | Urgent Pointer | 432 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 433 | Options | Padding | 434 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 435 | data | 436 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 438 Note that one tick mark represents one bit position 440 Figure 2: Transmission Control Protocol header format 442 The minimum TCP header size is 20 bytes, and corresponds to a TCP 443 segment with no options and no data. However, a TCP module might be 444 handed an (illegitimate) "TCP segment" of less than 20 bytes. 445 Therefore, before doing any processing of the TCP header fields, the 446 following check should be performed by TCP on the segments handed by 447 the internet layer: 449 Segment.Size >= 20 451 If a segment does not pass this check, it should be dropped. 453 The following subsections contain further sanity checks that should 454 be performed on TCP segments. 456 3.1. Source Port 458 This field contains a 16-bit number that identifies the TCP end-point 459 that originated this TCP segment. Being a 16-bit field, it can 460 contain any value in the range 0-65535. 462 The Internet Assigned Numbers Authority (IANA) has traditionally 463 reserved the following use of the 16-bit port range of TCP [IANA, 464 2008]: 466 o The Well Known Ports, 0 through 1023 468 o The Registered Ports, 1024 through 49151 470 o The Dynamic and/or Private Ports, 49152 through 65535 472 The range of assigned ports managed by the IANA is 0-1023, with the 473 remainder being registered by IANA but not assigned [IANA, 2008]. It 474 is also worth noting that, while some systems restrict use of the 475 port numbers in the range 0-1024 to privileged users, no trust should 476 be granted based on the port numbers used for a TCP connection. 478 Servers usually bind specific ports on which specific services are 479 usually provided, while clients usually make use of the so-called 480 "ephemeral ports" for the source port of their outgoing connections 481 with the only requirement that the resulting four-tuple must be 482 unique (not currently in use by any other transport protocol 483 instance). 485 While the only requirement for a selected ephemeral port is that the 486 resulting four-tuple (connection-id) is unique, in practice it may be 487 necessary to not allow the allocation of port numbers that are in use 488 by a TCP that is in the LISTEN or CLOSED states for use as ephemeral 489 ports, as this might allow an attacker to "steal" incoming 490 connections from a local server application. Section 10.2 of this 491 document provides a detailed discussion of this issue. 493 It should also be noted that some clients, such as DNS resolvers, are 494 known to use port numbers from the "Well Known Ports" range. 495 Therefore, middle-boxes such as packet filters should not assume that 496 clients use port number from only the Dynamic or Registered port 497 ranges. 499 While port 0 is a legitimate port number, it has a special meaning in 500 the UNIX Sockets API. For example, when a TCP port number of 0 is 501 passed as an argument to the bind() function, rather than binding 502 port 0, an ephemeral port is selected for the corresponding TCP end- 503 point. As a result, the TCP port number 0 is never actually used in 504 TCP segments. 506 Different implementations have been found to respond differently to 507 TCP segments that have a port number of 0 as the Source Port and/or 508 the Destination Port. As a result, TCP segments with a port number 509 of 0 are usually employed for remote OS detection via TCP/IP stack 510 fingerprinting [Jones, 2003]. 512 Since in practice TCP port 0 is not used by any legitimate 513 application and is only used for fingerprinting purposes, a number of 514 host implementations already reject TCP segments that use 0 as the 515 Source Port and/or the Destination Port. Also, a number firewalls 516 filter (by default) any TCP segments that contain a port number of 517 zero for the Source Port and/or the Destination Port. 519 We therefore recommend that TCP implementations respond to incoming 520 TCP segments that have a Source Port of 0 with an RST (provided these 521 incoming segments do not have the RST bit set). 523 Responding with an RST segment to incoming segments that have the RST 524 bit would open the door to RST-war attacks. 526 As discussed in Section 3.2, we also recommend TCP implementations to 527 respond with an RST to incoming packets that have a Destination Port 528 of 0 (provided these incoming segments do not have the RST bit set). 530 3.1.1. Problems that may arise as a result of collisions of connection- 531 id's 533 A number of implementations will not allow the creation of a new 534 connection if there exists a previous incarnation of the same 535 connection in any state other than the fictional state CLOSED. This 536 can be problematic in scenarios in which a client establishes 537 connections with a specific service at a particular server at a high 538 rate: even if the connections are also closed at a high rate, one of 539 the systems (the one performing the active close) will keep each of 540 the closed connections in the TIME-WAIT state for 2*MSL. 542 MSL (Maximum Segment Lifetime) is the maximum amount of time that a 543 TCP segment can exist in an internet. It is defined to be 2 minutes 544 by RFC 793 [Postel, 1981c]. 546 If the connection rate is high enough, at some point all the 547 ephemeral ports at the client will be in use by some connection in 548 the TIME-WAIT state, thus preventing the establishment of new 549 connections. In order to overcome this problem, a number of TCP 550 implementations include some heuristics to allow the creation of a 551 new incarnation of a connection that is in the TIME-WAIT state. In 552 such implementations a new incarnation of a previous connection is 553 allowed if: 555 o The incoming SYN segment contains a timestamp option, and the 556 timestamp is greater than the last timestamp seen in the previous 557 incarnation of the connection (for that direction of the data 558 transfer), or, 560 o The incoming SYN segment does not contain a timestamp option, but 561 its Initial Sequence Number (ISN) is greater than the last 562 sequence number seen in the previous incarnation of the connection 563 (for that direction of the data transfer) 565 Unfortunately, these heuristics are optional, and thus cannot be 566 relied upon. Additionally, as indicated by [Silbersack, 2005], if 567 the Timestamp or the ISN are trivially randomized, these heuristics 568 might fail. 570 Section 3.3.1 and Section 4.7.1 of this document recommend algorithms 571 for the generation of TCP Initial Sequence Numbers and TCP 572 timestamps, respectively, that provide randomization, while still 573 allowing the aforementioned heuristics to work. 575 Therefore, the only strategy that can be relied upon to avoid this 576 interoperability problem is to minimize the rate of collisions of 577 connection-id's. A good algorithm to minimize rate of collisions of 578 connection-id's would consider the time a given four-tuple {Source 579 Address, Source Port, Destination Address, Destination Port} was last 580 used, and would try avoid reusing it for 2*MSL. However, an 581 efficient implementation approach for this algorithm has not yet been 582 devised. A simple approach to minimize the rate collisions of 583 connection-id's in most scenarios is to maximize the port reuse 584 cycle, such that a port number is not reused before all the other 585 port numbers in the ephemeral port range have been used for outgoing 586 connections. This is the traditional ephemeral port selection 587 algorithm in 4.4BSD implementations. 589 However, if a single global variable is used to keep track of the 590 last ephemeral port selected, ephemeral port numbers become trivially 591 predictable. 593 Section 3.1.2 of this document analyzes a number of approaches for 594 obfuscating the TCP ephemeral ports, such that the chances of an 595 attacker of guessing the ephemeral ports used for future connections 596 are reduced, while still reducing the probability of collisions of 597 connection-id's. Finally, Section 3.1.3 makes recommendations about 598 the port range that should be used for the ephemeral ports. 600 3.1.2. Port randomization algorithms 602 Since most "blind" attacks against TCP require the attacker to guess 603 or know the four-tuple that identifies the TCP connection to be 604 attacked [Gont, 2008a] [Touch, 2007] [Watson, 2004], obfuscation of 605 this four-tuple to an off-path attacker requires, in a number of 606 scenarios, much more work on the side of the attacker to successfully 607 perform any of these attacks against a TCP connection. Therefore, we 608 recommend that TCP implementations randomize their ephemeral ports. 610 There are a number of factors to consider when designing a policy of 611 selection of ephemeral ports, which include: 613 o Minimizing the predictability of the ephemeral port numbers used 614 for future connections 616 o Minimizing the rate of collisions of connection-id's 618 o Avoiding conflicts with applications that depend on the use of 619 specific port numbers 621 Given the goal of improving TCP's resistance to attack by obfuscation 622 of the four-tuple that identifies a TCP connection, it is key to 623 minimize the predictability of the ephemeral ports that will be 624 selected for new connections. While the obvious approach to address 625 this requirement would be to select the ephemeral ports by simply 626 picking a random value within the chosen ephemeral port number range, 627 this straightforward policy may lead to a short reuse cycle of port 628 numbers, which could lead to the interoperability problems discussed 629 in [Silbersack, 2005]. 631 It is also worth noting that, provided adequate randomization 632 algorithms are in use, the larger the range from which ephemeral pots 633 are selected, the smaller the chances of an attacker are to guess the 634 selected port number. This is discussed in Section 3.1.3 of this 635 document. 637 [Larsen and Gont, 2008] provides a detailed discussion of a number of 638 algorithms for obfuscating the ephemeral ports. The properties of 639 these algorithms have been empirically analyzed in [Allman, 2008]. 641 [Larsen and Gont, 2008] recently suggested an approach that is meant 642 to comply with the requirements stated above, which resembles the 643 proposal in RFC 1948 [Bellovin, 1996] for selecting TCP Initial 644 Sequence Numbers. Basically, it proposes to give each triple {Source 645 Address, Destination Address, Destination Port} a separate port 646 number space, by selecting ephemeral ports by means of an expression 647 of the form: 649 port = min_port + (counter + F()) % (max_port - min_port + 1) 651 Equation 1: Simple hash-based ephemeral port selection algorithm 653 where: 655 port 656 Ephemeral port number selected for this connection 658 min_port 659 Lower limit of the ephemeral port number space 661 max_port 662 Upper limit of the ephemeral port number space 664 counter 665 A variable that is initialized to some arbitrary value, and is 666 incremented once for each port number that is selected 668 F() 669 A hash function that should take as input both the local and 670 remote IP addresses, the TCP destination port, and a secret key. 671 The result of F should not be computable without the knowledge of 672 all the parameters of the hash function 674 The hash function F() separates the port number space for each triple 675 {Source Address, Destination Address, Destination Port} by providing 676 an "offset" in the port number space that is unique (assuming no hash 677 collisions) for each triple. As a result, subsequent connections to 678 the same end-point would be assigned incremental port numbers, thus 679 maximizing the port reuse cycle while still making it difficult for 680 an attacker to guess the selected ephemeral port number used for 681 connections with other endpoints. 683 Keeping track of the last ephemeral port selected for each of the 684 possible values of F() would require a considerable amount of system 685 memory. Therefore, a possible approach would be to keep a global 686 counter variable, which would reduce the required system memory at 687 the expense of a shorter port reuse cycle. This latter approach 688 would have the same port reuse properties than the widely implemented 689 approach of selecting ephemeral port numbers incrementally (without 690 randomization), while still reducing the predictability of ephemeral 691 port numbers used for connections with other endpoints. Figure 3 692 shows this algorithm in pseudo-code. 694 /* Initialization code at system boot time. 695 Initialization value could be random. */ 696 next_ephemeral = 0; 698 /* Ephemeral port selection function */ 699 num_ephemeral = max_ephemeral - min_ephemeral + 1; 700 offset = F(local_IP, remote_IP, remote_port, secret_key); 701 count = num_ephemeral; 703 do { 704 port = min_ephemeral + (next_ephemeral + offset) % num_ephemeral; 705 next_ephemeral++; 707 if(five-tuple is unique) 708 return port; 710 count--; 712 } while (count > 0); 714 return ERROR; 716 Figure 3: Simple hash-based ephemeral port selection algorithm 718 An analysis of a sample scenario can help to understand how this 719 algorithm works. Table 2 illustrates, for a number of consecutive 720 connection requests, some possible values for each of the variables 721 used in this ephemeral port selection algorithm. Additionally, the 722 table shows the result of the port selection function. 724 +--------+---------+----------+-------+-----------+-------+---------+ 725 | Nr. | IP | offset | min_p | max_port | count | port | 726 | | address | | o rt | | e r | | 727 | | : port | | | | | | 728 +--------+---------+----------+-------+-----------+-------+---------+ 729 | #1 | 10.0.0. | 1000 | 1024 | 3048 | #2 | 10.0.0. | 730 | | 1 :80 | | | | | 1 :80 | 731 +--------+---------+----------+-------+-----------+-------+---------+ 732 | 1000 | 1025 | 3049 | #3 | 192.168.0 | 4500 | 1026 | 733 | | | | | . 1:80 | | | 734 +--------+---------+----------+-------+-----------+-------+---------+ 735 | 6550 | #4 | 192.168. | 4500 | 1027 | 6551 | #5 | 736 | | | 0 .1:80 | | | | | 737 +--------+---------+----------+-------+-----------+-------+---------+ 738 | 10.0.0 | 1000 | 1028 | 3052 | | | | 739 | . 1:80 | | | | | | | 740 +--------+---------+----------+-------+-----------+-------+---------+ 742 Table 1: Sample scenario for a simple hash-based port randomization 743 algorithm 745 The first two entries of the table illustrate the contents of each of 746 the variables when two ephemeral ports are selected to establish two 747 consecutive connections to the same remote end-point {10.0.0.1, 80}. 748 The two ephemeral ports that get selected belong to the same port 749 number "sequence", since the result of the hash function F() is the 750 same in both cases. The second and third entries of the table 751 illustrate the contents of each of the variables when the algorithm 752 later selects two ephemeral ports to establish two consecutive 753 connections to the remote end-point {192.168.0.1, 80}. The result of 754 F() is the same for these two cases, and thus the two ephemeral ports 755 that get selected belong to the same "sequence". However, this 756 sequence is different from that of the first two port numbers 757 selected before, as the value of F() is different from that obtained 758 when those two ports numbers (#1 and #2) were selected earlier. 759 Finally, in entry #5 another ephemeral port is selected to connect to 760 the same end-point as in entries #1 and #2. We note that the 761 selected port number belongs to the same sequence as the first two 762 port numbers selected (#1 and #2), but that two ports of that 763 sequence (3050 and 3051) have been skipped. This is the consequence 764 of having a single global counter variable that gets incremented 765 whenever a port number is selected. When counter is incremented as a 766 result of the port selections #3 and #4, this causes two ports (3050 767 and 3051) in all the other the port number sequences to be "skipped", 768 unnecessarily. 770 [Larsen and Gont, 2008] describes an improvement to this algorithm, 771 in which a value derived from the three-tuple {Source Address, 772 Destination Address, Destination Port} is used as an index into an 773 array of "counter" variables, which would be used in the equation 774 described above. The rationale of this approach is that the 775 selection of an ephemeral port number for a given three-tuple {Source 776 Address, Destination Address, Destination Port} should not 777 necessarily cause the counter variables corresponding to other three- 778 tuples to be incremented. Figure 4 illustrates this improved 779 algorithm in pseudo-code. 781 /* Initialization at system boot time */ 782 for(i = 0; i < TABLE_LENGTH; i++) 783 table[i] = random() % 65536; 785 /* Ephemeral port selection function */ 786 num_ephemeral = max_ephemeral - min_ephemeral + 1; 787 offset = F(local_IP, remote_IP, remote_port, secret_key); 788 index = G(offset); 789 count = num_ephemeral; 791 do { 792 port = min_ephemeral + (offset + table[index]) % num_ephemeral; 793 table[index]++; 795 if(five-tuple is unique) 796 return port; 798 count--; 800 } while (count > 0); 802 return ERROR; 804 Figure 4: Double hash-based ephemeral port selection algorithm 806 Table 2 illustrates a possible result for the same sequence of events 807 as those in Table 1, along with the values for each of the involved 808 variables. 810 +-----+-----------------+--------+-------+--------------+------+ 811 | Nr. | IP address:port | offset | index | table[index] | port | 812 +-----+-----------------+--------+-------+--------------+------+ 813 | #1 | 10.0.0.1:80 | 1000 | 10 | 1024 | 3048 | 814 +-----+-----------------+--------+-------+--------------+------+ 815 | #2 | 10.0.0.1:80 | 1000 | 10 | 1025 | 3049 | 816 +-----+-----------------+--------+-------+--------------+------+ 817 | #3 | 192.168.0.1:80 | 4500 | 15 | 1024 | 6548 | 818 +-----+-----------------+--------+-------+--------------+------+ 819 | #4 | 192.168.0.1:80 | 4500 | 15 | 1025 | 6549 | 820 +-----+-----------------+--------+-------+--------------+------+ 821 | #5 | 10.0.0.1:80 | 1000 | 10 | 1026 | 3050 | 822 +-----+-----------------+--------+-------+--------------+------+ 824 Table 2: Sample scenario for a double hash-based port randomization 825 algorithm 827 The table illustrates that the destination end-points "10.0.0.1:80" 828 and "192.168.0.1:80" result in different values for index and 829 therefore the increments in one of the port number sequence does not 830 affect the other sequences, thus minimizing the port reuse frequency. 832 We recommend the implementation of the ephemeral port selection 833 algorithm illustrated in Figure 4. 835 3.1.3. TCP ephemeral port range 837 We recommend that TCP select ephemeral ports from the range 1024- 838 65535 (i.e., set min_port and the max_port variables of the previous 839 section to 1024 and 65535, respectively). This maximizes the port 840 number space from which the ephemeral ports are selected, while 841 intentionally excluding the port numbers in the range 0-1023, which 842 in UNIX systems have traditionally required super-user privileges to 843 bind them. 845 4.4BSD implementations have traditionally chosen ephemeral ports from 846 the range 1024-5000, thus greatly increasing the chances of an 847 attacker of guessing the selected port number [Wright and Stevens, 848 1994]. Unfortunately, most current implementations are still using a 849 small range of the whole port number space, such as 1024-49151 or 850 49152-65535. 852 It is important to note that a number of applications rely on binding 853 specific port numbers that may be within the ephemeral ports range. 854 If such an application was run while the corresponding port number 855 was in use, the application would fail. 857 This problem does not arise from port randomization itself, and has 858 actually been experienced by users of popular TCP implementations 859 that do not actually randomize their ephemeral ports. 861 A solution to this potential problem would be to maintain a list of 862 port numbers that are usually needed for running popular 863 applications. In case the port number selected by Equation 1 was in 864 such a list, the next available port number would be selected, 865 instead. This "list" of port numbers could be implemented as an 866 array of bits, in which each bit would correspond to each of the 867 65536 TCP port numbers, with a value of 0 (zero) meaning that the 868 corresponding TCP port is available for allocation as an ephemeral 869 port, and a value of 1 (one) meaning that the corresponding port 870 number should not be allocated as an ephemeral port. The 871 specification of which ports should be "reserved" for applications 872 may depend on the underlying operating system, and is out of the 873 scope of this document. 875 As discussed in Section 3.1 and Section 10.2, in practice it may be 876 necessary to not allow the allocation as "ephemeral ports" of those 877 port numbers that are currently in use by a TCP that is in the LISTEN 878 or CLOSED states, as this might allow an attacker to "steal" incoming 879 connections from a local server application. Section 10.2 of this 880 document provides a detailed discussion of this issue. 882 3.2. Destination port 884 This field contains the destination TCP port of this segment. Being 885 a 16-bit value, it can contain any value in the range 0-65535. While 886 some systems restrict use of the ports numbers in the range 0-1023 to 887 privileged users, no trust should be granted based on the port 888 numbers in use for a connection. 890 As noted in Section 3.1 of this document, while port 0 is a 891 legitimate port number, it has a special meaning in the UNIX Sockets 892 API. For example, when a TCP port number of 0 is passed as an 893 argument to the bind() function, rather than binding port 0, an 894 ephemeral port is selected for the corresponding TCP end-point. As a 895 result, the TCP port number 0 is never actually used in TCP segments. 897 Different implementations have been found to respond differently to 898 TCP segments that have a port number of 0 as the Source Port and/or 899 the Destination Port. As a result, TCP segments with a port number 900 of 0 are usually employed for remote OS detection via TCP/IP stack 901 fingerprinting [Jones, 2003]. 903 Since in practice TCP port 0 is not used by any legitimate 904 application and is only used for fingerprinting purposes, a number of 905 host implementations already reject TCP segments that use 0 as the 906 Source Port and/or the Destination Port. Also, a number firewalls 907 filter (by default) any TCP segments that contain a port number of 908 zero for the Source Port and/or the Destination Port. 910 We therefore recommend that TCP implementations respond to incoming 911 TCP segments that have a Destination Port of 0 with an RST (provided 912 these incoming segments do not have the RST bit set). 914 Responding with an RST segment to incoming packets that have the RST 915 bit would open the door to RST-war attacks. 917 Some systems have been found to be unable to process TCP segments in 918 which the source endpoint {Source Address, Source Port} is the same 919 than the destination end-point {Destination Address, Destination 920 Port}. Such TCP segments have been reported to cause malfunction of 921 a number of implementations [CERT, 1996], and have been exploited in 922 the past to perform Denial of Service (DoS) attacks [Meltman, 1997]. 923 While these packets are very very unlikely to exist in real and 924 legitimate scenarios, TCP should nevertheless be able to process them 925 without the need of any "extra" code. 927 A SYN segment in which the source end-point {Source Address, Source 928 Port} is the same as the destination end-point {Destination Address, 929 Destination Port} will result in a "simultaneous open" scenario, such 930 as the one described in page 32 of RFC 793 [Postel, 1981c]. 931 Therefore, those TCP implementations that correctly handle 932 simultaneous opens should already be prepared to handle these unusual 933 TCP segments. 935 3.3. Sequence number 937 This field contains the sequence number of the first data octet in 938 this segment. If the SYN flag is set, the sequence number is the 939 Initial Sequence Number (ISN) of the connection, and the first data 940 octet has the sequence number ISN+1. 942 3.3.1. Generation of Initial Sequence Numbers 944 The choice of the Initial Sequence Number of a connection is not 945 arbitrary, but aims to minimize the chances of a stale segment from 946 being accepted by a new incarnation of a previous connection. RFC 947 793 [Postel, 1981c] suggests the use of a global 32-bit ISN 948 generator, whose lower bit is incremented roughly every 4 949 microseconds. 951 However, use of such an ISN generator makes it trivial to predict the 952 ISN that a TCP will use for new connections, thus allowing a variety 953 of attacks against TCP, such as those described in Section 5.2 and 954 Section 11 of this document. This vulnerability was first described 955 in [Morris, 1985], and its exploitation was widely publicized about 956 10 years later [Shimomura, 1995]. 958 As a matter of fact, protection against old stale segments from a 959 previous incarnation of the connection comes from allowing the 960 creation of a new incarnation of a previous connection only after 961 2*MSL have passed since a segment corresponding to the old 962 incarnation was last seen. This is accomplished by the TIME-WAIT 963 state, and TCP's "quiet time" concept. However, as discussed in 964 Section 3.1 and Section 11.1.2 of this document, the ISN can be used 965 to perform some heuristics meant to avoid an interoperability problem 966 that may arise when two systems establish connections at a high rate. 967 In order for such heuristics to work, the ISNs generated by a TCP 968 should be monotonically increasing. 970 RFC 1948 [Bellovin, 1996] proposed a scheme that greatly reduces the 971 chances of an attacker from guessing the ISN of a TCP, while still 972 producing a monotonically-increasing sequence that allows 973 implementation of the optimization described in Section 3.1 and 974 Section 11.1.2 of this document. Basically, the document proposes to 975 compute the ISN of a new connection as a result of the expression: 977 ISN = M + F(localhost, localport, remotehost, remoteport, secret_key) 979 where M is a monotonically increasing counter maintained within TCP, 980 and F() is a hash function. As it is vital that F() not be 981 computable from the outside, RFC 1948 [Bellovin, 1996] suggests it to 982 be a cryptographic hash function of the connection-id and some secret 983 data. 985 RFC 1948 [Bellovin, 1996] proposes that F() be a MD5 hash function 986 applied to the connection-id and some secret data. While there have 987 been concerns regarding the properties of MD5 as a hash function, in 988 this case it is simply used for obfuscating the ISN, rather than for 989 signing the data contained in the TCP segments. While the MD5 990 function could be replaced by a more secure hash function, at the 991 point in which this issue becomes a concern, proper authentication 992 mechanisms such as IPsec [Kent and Seo, 2005] should be considered 993 for protecting the corresponding TCP connection. 995 [CERT, 2001] and [US-CERT, 2001] are advisories about the security 996 implications of weak ISN generators. [Zalewski, 2001a] and 997 [Zalewski, 2002] contain a detailed analysis of ISN generators, and a 998 survey of the algorithms in use by popular TCP implementations. 1000 Finally, another security consideration that should be made about TCP 1001 sequence numbers is that they might allow an attacker to count the 1002 number of systems behind a Network Address Translator (NAT) 1003 [Srisuresh and Egevang, 2001]. Depending on the ISN generators 1004 implemented by each of the systems behind the NAT, an attacker might 1005 be able to count the number of systems behind the NAT by establishing 1006 a number of TCP connections (using the public address of the NAT) and 1007 indentifying the number of different sequence number "spaces". This 1008 information leakage could be eliminated by rewriting the contents of 1009 all those header fields and options that make use of sequence numbers 1010 (such as the Sequence Number and the Acknowledgement Number fields, 1011 and the SACK Option) at the NAT. [Gont and Srisuresh, 2008] provides 1012 a detailed discussion of the security implications of NATs and of the 1013 possible mitigations for this and other issues. 1015 3.4. Acknowledgement Number 1017 If the ACK bit is on, the Acknowledgement Number contains the value 1018 of the next sequence number the sender of this segment is expecting 1019 to receive. According to RFC 793, the Acknowledgement Number is 1020 considered valid as long as it does not acknowledge the receipt of 1021 data that has not yet been sent. That is, the following expression 1022 must be true: 1024 SEG.ACK <= SND.NXT 1026 As a result of recent concerns on forgery attacks against TCP (see 1027 Section 11 of this document), ongoing work at the IETF [Ramaiah et 1028 al, 2008] has proposed to enforce a more strict check on the 1029 Acknowledgement Number. The following check should be enforced on 1030 segments that have the ACK bit set: 1032 SND.UNA - SND.MAX.WND <= SEG.ACK <= SND.NXT 1034 If a TCP segment does not pass this check, the segment should be 1035 dropped, and an ACK segment should be sent in response. 1037 If the ACK bit is off, the Acknowledgement Number field is not valid. 1038 We recommend TCP implementations to set the Acknowledgement Number to 1039 zero when sending a TCP segment that does not have the ACK bit set 1040 (i.e., a SYN segment). 1042 Some TCP implementations have been known to fail to set the 1043 Acknowledgement Number to zero, thus leaking information. 1045 TCP Acknowledgements are also used to perform heuristics for loss 1046 recovery and congestion control. Section 9 of this document 1047 describes a number of ways in which these mechanisms can be 1048 exploited. 1050 3.5. Data Offset 1052 The Data Offset field indicates the length of the TCP header in 32- 1053 bit words. As the minimum TCP header size is 20 bytes, the minimum 1054 legal value for this field is 5. Therefore, the following check 1055 should be enforced: 1057 Data Offset >= 5 1059 For obvious reasons, the TCP header cannot be larger than the whole 1060 TCP segment it is part of. Therefore, the following check should be 1061 enforced: 1063 Data Offset * 4 <= TCP segment length 1065 The TCP segment length should be obtained from the IP layer, as TCP 1066 does not include a TCP segment length field. 1068 3.6. Control bits 1070 The following subsections provide a discussion of the different 1071 control bits in the TCP header. TCP segments with unusual 1072 combinations of flags set have been known in the past to cause 1073 malfunction of some implementations, sometimes to the extent of 1074 causing them to crash [Postel, 1987] [Braden, 1992]. These packets 1075 are still usually employed for the purpose of TCP/IP stack 1076 fingerprinting. Section 12.1 contains a discussion of TCP/IP stack 1077 fingerprinting. 1079 3.6.1. Reserved (four bits) 1081 These four bits are reserved for future use, and must be zero. As 1082 with virtually every field, the Reserved field could be used as a 1083 covert channel. While there exist intermediate devices such as 1084 protocol scrubbers that clear these bits, and firewalls that drop/ 1085 reject segments with any of these bits set, these devices should 1086 consider the impact of these policies on TCP interoperability. For 1087 example, as TCP continues to evolve, all or part of the bits in the 1088 Reserved field could be used to implement some new functionality. If 1089 some middle-box or end-system implementation were to drop a TCP 1090 segment merely because some of these bits are not set to zero, 1091 interoperability problems would arise. 1093 Therefore, we recommend implementations to simply ignore the Reserved 1094 field. 1096 3.6.2. CWR (Congestion Window Reduced) 1098 The CWR flag, defined in RFC 3168 [Ramakrishnan et al, 2001], is used 1099 as part of the Explicit Congestion Notification (ECN) mechanism. For 1100 connections in any of the synchronized states, this flag indicates, 1101 when set, that the TCP sending this segment has reduced its 1102 congestion window. 1104 An analysis of the security implications of ECN can be found in 1105 Section 9.3 of this document. 1107 3.6.3. ECE (ECN-Echo) 1109 The ECE flag, defined in RFC 3168 [Ramakrishnan et al, 2001], is used 1110 as part of the Explicit Congestion Notification (ECN) mechanism. 1112 Once a TCP connection has been established, an ACK segment with the 1113 ECE bit set indicates that congestion was encountered in the network 1114 on the path from the sender to the receiver. This indication of 1115 congestion should be treated just as a congestion loss in non-ECN- 1116 capable TCP [Ramakrishnan et al, 2001]. Additionally, TCP should not 1117 increase the congestion window (cwnd) in response to such an ACK 1118 segment that indicates congestion, and should also not react to 1119 congestion indications more than once every window of data (or once 1120 per round-trip time). 1122 An analysis of the security implications of ECN can be found in 1123 Section 9.3 of this document. 1125 3.6.4. URG 1127 When the URG flag is set, the Urgent Pointer field contains the 1128 current value of the urgent pointer. 1130 Receipt of an "urgent" indication generates, in a number of 1131 implementations (such as those in UNIX-like systems), a software 1132 interrupt (signal) that is delivered to the corresponding process. 1134 In UNIX-like systems, receipt of an urgent indication causes a SIGURG 1135 signal to be delivered to the corresponding process. 1137 A number of applications handle TCP urgent indications by installing 1138 a signal handler for the corresponding signal (e.g., SIGURG). As 1139 discussed in [Zalewski, 2001b], some signal handlers can be 1140 maliciously exploited by an attacker, for example to gain remote 1141 access to a system. While secure programming of signal handlers is 1142 out of the scope of this document, we nevertheless raise awareness 1143 that TCP urgent indications might be exploited to abuse poorly- 1144 written signal handlers. 1146 Section 3.9 discusses the security implications of the TCP urgent 1147 mechanism. 1149 3.6.5. ACK 1151 When the ACK bit is one, the Acknowledgment Number field contains the 1152 next sequence number expected, cumulatively acknowledging the receipt 1153 of all data up to the sequence number in the Acknowledgement Number, 1154 minus one. Section 3.4 of this document describes sanity checks that 1155 should be performed on the Acknowledgement Number field. 1157 TCP Acknowledgements are also used to perform heuristics for loss 1158 recovery and congestion control. Section 9 of this document 1159 describes a number of ways in which these mechanisms can be 1160 exploited. 1162 3.6.6. PSH 1164 RFC 793 [Postel, 1981c] contains (in pages 54-64) a functional 1165 description of a TCP Application Programming Interface (API). One of 1166 the parameters of the SEND function is the PUSH flag which, when set, 1167 signals the local TCP that it must send all unsent data. The TCP PSH 1168 (PUSH) flag will be set in the last outgoing segment, to signal the 1169 push function to the receiving TCP. Upon receipt of a segment with 1170 the PSH flag set, the receiving user's buffer is returned to the 1171 user, without waiting for additional data to arrive. 1173 There are two security considerations arising from the PUSH function. 1174 On the sending side, an attacker could cause a large amount of data 1175 to be queued for transmission without setting the PUSH flag in the 1176 SEND call. This would prevent the local TCP from sending the queued 1177 data, causing system memory to be tied to those data for an 1178 unnecessarily long period of time. 1180 An analogous consideration should be made for the receiving TCP. TCP 1181 is allowed to buffer incoming data until the receiving user's buffer 1182 fills or a segment with the PSH bit set is received. If the 1183 receiving TCP implements this policy, an attacker could send a large 1184 amount of data, slightly less than the receiving user's buffer size, 1185 to cause system memory to be tied to these data for an unnecessarily 1186 long period of time. Both of these issues are discussed in Section 1187 4.2.2.2 of RFC 1122 [Braden, 1989]. 1189 In order to mitigate these potential vulnerabilities, we suggest 1190 assuming an implicit "PUSH" in every SEND call. On the sending side, 1191 this means that as a result of a SEND call TCP should try to send all 1192 queued data (provided that TCP's flow control and congestion control 1193 algorithms allow it). On the receiving side, this means that the 1194 received data will be immediately delivered to an application calling 1195 the RECEIVE function, even if the data already available are less 1196 than those requested by the application. 1198 It is interesting to note that popular TCP APIs (such as "sockets") 1199 do not provide a PUSH flag in any of the interfaces they define, but 1200 rather perform some kind of "heuristics" to set the PSH bit in 1201 outgoing segments. As a result, the value of the PSH bit in the 1202 received TCP segments is usually a policy of the sending TCP, rather 1203 than a policy of the sending application. All robust applications 1204 that make use of those APIs (such as the sockets API) properly handle 1205 the case of a RECEIVE call returning less data (e.g., zero) than 1206 requested, usually by performing subsequent RECEIVE calls. 1208 Another potential malicious use of the PSH bit would be for an 1209 attacker to send small TCP segments (probably with zero bytes of data 1210 payload) to cause the receiving application to be unnecessarily woken 1211 up (increasing the CPU load), or to cause malfunction of poorly- 1212 written applications that may not handle well the case of RECEIVE 1213 calls returning less data than requested. 1215 3.6.7. RST 1217 The RST bit is used to request the abortion (abnormal close) of a TCP 1218 connection. RFC 793 [Postel, 1981c] suggests that an RST segment 1219 should be considered valid if its Sequence Number is valid (i.e., 1220 falls within the receive window). However, in response to the 1221 security concerns raised by [Watson, 2004] and [NISCC, 2004], 1222 [Ramaiah et al, 2008] suggests the following alternative processing 1223 rules for RST segments: 1225 o If the Sequence Number of the RST segment is not valid (i.e., 1226 falls outside of the receive window), silently drop the segment. 1228 o If the Sequence Number of the RST segment matches the next 1229 expected sequence number (RCV.NXT), abort the corresponding 1230 connection. 1232 o If the Sequence Number is valid (i.e., falls within the receive 1233 window) but is not exactly RCV.NXT, send an ACK segment (a 1234 "challenge ACK") of the form: 1236 [Ramaiah et al, 2008] suggests that implementations should rate-limit 1237 the challenge ACK segments sent as a result of implementation of this 1238 mechanism. 1240 Section 11.1 of this document describes TCP-based connection-reset 1241 attacks, along with a number of countermeasures to mitigate their 1242 impact. 1244 3.6.8. SYN 1246 The SYN bit is used during the connection-establishment phase, to 1247 request the synchronization of sequence numbers. 1249 There are basically four different vulnerabilities that make use of 1250 the SYN bit: SYN-flooding attacks, connection forgery attacks, 1251 connection flooding attacks, and connection-reset attacks. They are 1252 described in Section 5.1, Section 5.2, Section 5.3, and Section 1253 11.1.2, respectively, along with the possible countermeasures. 1255 3.6.9. FIN 1257 The FIN flag is used to signal the remote end-point the end of the 1258 data transfer in this direction. Receipt of a valid FIN segment 1259 (i.e., a TCP segment with the FIN flag set) causes the transition in 1260 the connection state, as part of what is usually referred to as the 1261 "connection termination phase". 1263 The connection-termination phase can be exploited to perform a number 1264 of resource-exhaustion attacks. Section 6 of this document describes 1265 a number of attacks that exploit the connection-termination phase 1266 along with the possible countermeasures. 1268 3.7. Window 1270 The TCP Window field advertises how many bytes of data the remote 1271 peer is allowed to send before a new advertisement is made. 1272 Theoretically, the maximum transfer rate that can be achieved by TCP 1273 is limited to: 1275 Maximum Transfer Rate = Window / RTT 1277 This means that, under ideal network conditions (e.g., no packet 1278 loss), the TCP Window in use should be at least: 1280 Window = 2 * Bandwidth * Delay 1282 Using a larger Window than that resulting from the previous equation 1283 will not provide any improvements in terms of performance. 1285 In practice, selection of the most convenient Window size may also 1286 depend on a number of other parameters, such as: packet loss rate, 1287 loss recovery mechanisms in use, etc. 1289 3.7.1. Security implications of the maximum TCP window size 1291 An aspect of the TCP Window that is usually overlooked is the 1292 security implications of its size. Increasing the TCP window 1293 increases the sequence number space that will be considered "valid" 1294 for incoming segments. Thus, use of unnecessarily large TCP Window 1295 sizes increases TCP's vulnerability to forgery attacks unnecessarily. 1297 In those scenarios in which the network conditions are known and/or 1298 can be easily predicted, it is recommended that the TCP Window is 1299 never set to a value larger than that resulting from the equations 1300 above. Additionally, the nature of the application running on top of 1301 TCP should be considered when tuning the TCP window. As an example, 1302 an H.245 signaling application certainly does not have high 1303 requirements on throughput, and thus a window size of around 4 KBytes 1304 will usually fulfill its needs, while keeping TCP's resistance to 1305 off-path forgery attacks at a decent level. Some rough measurements 1306 seem to indicate that a TCP window of 4Kbytes is common practice for 1307 TCP connections servicing applications such as BGP. 1309 In principle, a possible approach to avoid requiring administrators 1310 to manually set the TCP window would be to implement an automatic 1311 buffer tuning mechanism, such as that described in [Heffner, 2002]. 1312 However, as discussed in Section 7.3.2 of this document these 1313 mechanisms can be exploited to perform other types of attacks. 1315 3.7.2. Security implications arising from closed windows 1317 The TCP window is a flow-control mechanism that prevents a fast data 1318 sender application from overwhelming a "slow" receiver. When a TCP 1319 end-point is not willing to receive any more data (before some of the 1320 data that have already been received are consumed), it will advertise 1321 a TCP window of zero bytes. This will effectively stop the sender 1322 from sending any new data to the TCP receiver. Transmission of new 1323 data will resume when the TCP receiver advertises a nonzero TCP 1324 window, usually with a TCP segment that contains no data ("an ACK"). 1326 This segment is usually referred to as a "window update", as the only 1327 purpose of this segment is to update the server regarding the new 1328 window. 1330 To accommodate those scenarios in which the ACK segment that "opens" 1331 the window is lost, TCP implements a "persist timer" that causes the 1332 TCP sender to query the TCP receiver periodically if the last segment 1333 received advertised a window of zero bytes. This probe simply 1334 consists of sending one byte of new data that will force the TCP 1335 receiver to send an ACK segment back to the TCP sender, containing 1336 the current TCP window. Similarly to the retransmission timeout 1337 timer, an exponential back-off is used when calculating the 1338 retransmission timer, so that the spacing between probes increases 1339 exponentially. 1341 A fundamental difference between the "persist timer" and the 1342 retransmission timer is that there is no limit on the amount of time 1343 during which a TCP can advertise a zero window. This means that a 1344 TCP end-point could potentially advertise a zero window forever, thus 1345 keeping kernel memory at the TCP sender tied to the TCP 1346 retransmission buffer. This could clearly be exploited as a vector 1347 for performing a Denial of Service (DoS) attack against TCP, such as 1348 that described in Section 7.1 of this document. 1350 Section 7.1 of this document describes a Denial of Service attack 1351 that aims at exhausting the kernel memory used for the TCP 1352 retransmission buffer, along with possible countermeasures. 1354 3.8. Checksum 1356 The Checksum field is an error detection mechanism meant for the 1357 contents of the TCP segment and a number of important fields of the 1358 IP header. It is computed over the full TCP header pre-pended with a 1359 pseudo header that includes the IP Source Address, the IP Destination 1360 Address, the Protocol number, and the TCP segment length. While in 1361 principle there should not be security implications arising from this 1362 field, due to non-RFC-compliant implementations, the Checksum can be 1363 exploited to detect firewalls, evade network intrusion detection 1364 systems (NIDS), and/or perform Denial of Service attacks. 1366 If a stateful firewall does not check the TCP Checksum in the 1367 segments it processes, an attacker can exploit this situation to 1368 perform a variety of attacks. For example, he could send a flood of 1369 TCP segments with invalid checksums, which would nevertheless create 1370 state information at the firewall. When each of these segments is 1371 received at its intended destination, the TCP checksum will be found 1372 to be incorrect, and the corresponding will be silently discarded. 1373 As these segments will not elicit a response (e.g., an RST segment) 1374 from the intended recipients, the corresponding connection state 1375 entries at the firewall will not be removed. Therefore, an attacker 1376 may end up tying all the state resources of the firewall to TCP 1377 connections that will never complete or be terminated, probably 1378 leading to a Denial of Service to legitimate users, or forcing the 1379 firewall to randomly drop connection state entries. 1381 If a NIDS does not check the Checksum of TCP segments, an attacker 1382 may send TCP segments with an invalid checksum to cause the NIDS to 1383 obtain a TCP data stream different from that obtained by the system 1384 being monitored. In order to "confuse" the NIDS, the attacker would 1385 send TCP segments with an invalid Checksum and a Sequence Number that 1386 would overlap the sequence number space being used for his malicious 1387 activity. FTester [Barisani, 2006] is a tool that can be used to 1388 assess NIDS on this issue. 1390 Finally, an attacker performing port-scanning could potentially 1391 exploit intermediate systems that do not check the TCP Checksum to 1392 detect whether a given TCP port is being filtered by an intermediate 1393 firewall, or the port is actually closed by the host being port- 1394 scanned. If a given TCP port appeared to be closed, the attacker 1395 would then send a SYN segment with an invalid Checksum. If this 1396 segment elicited a response (either an ICMP error message or a TCP 1397 RST segment) to this packet, then that response should come from a 1398 system that does not check the TCP checksum. Since normal host 1399 implementations of the TCP protocol do check the TCP checksum, such a 1400 response would most likely come from a firewall or some other middle- 1401 box. 1403 [Ed3f, 2002] describes the exploitation of the TCP checksum for 1404 performing the above activities. [US-CERT, 2005d] provides an 1405 example of a TCP implementation that failed to check the TCP 1406 checksum. 1408 3.9. Urgent pointer 1410 If the Urgent bit is set, the Urgent Pointer field communicates the 1411 current value of the urgent pointer as a positive offset from the 1412 Sequence Number in this segment. That is, the urgent pointer is 1413 obtained as: 1415 urgent_pointer = Sequence Number + Urgent Pointer 1417 According to RFC 1122 [Braden, 1989], the urgent pointer 1418 (urgent_pointer) points to the last byte of urgent data in the 1419 stream. However, in virtually all TCP implementations the urgent 1420 pointer has the semantics of pointing to the byte following the last 1421 byte of urgent data [Gont and Yourtchenko, 2009]. 1423 There was some ambiguity in RFC 793 [Postel, 1981c] with respect to 1424 the semantics of the urgent pointer. Section 4.2.2.4 of RFC 1122 1425 [Braden, 1989] clarified this ambiguity, stating that the urgent 1426 pointer points to the last byte of urgent data. However, the RFC 1427 1122 semantics for the urgent pointer never resulted into actual 1428 implementations. 1430 Ongoing work at the IETF [Gont and Yourtchenko, 2009] aims at 1431 updating the IETF specifications to change the semantics of the 1432 urgent pointer so that it points to "the byte following the last byte 1433 of urgent data", thus accommodating virtually all existing 1434 implementations of the TCP urgent mechanism. 1436 Section 3.7 of RFC 793 [Postel, 1981c] states (in page 42) that to 1437 send an urgent indication the user must also send at least one byte 1438 of data. Therefore, if the URG bit is set, the following check 1439 should be performed: 1441 Segment.Size - Data Offset * 4 > 0 1443 If a TCP segment with the URG bit set does not pass this check, it 1444 should be silently dropped. 1446 It is worth noting that the resulting urgent_pointer may refer to a 1447 sequence number not present in this segment. That is, the "last byte 1448 of urgent data" might be received in successive segments. 1450 If the URG bit is zero, the Urgent Pointer is not valid, and thus 1451 should not be processed by the receiving TCP. Nevertheless, we 1452 recommend TCP implementations to set the Urgent Pointer to zero when 1453 sending a TCP segment that does not have the URG bit set, and to 1454 ignore the Urgent Pointer (as required by RFC 793) when the URG bit 1455 is zero. 1457 Some stacks have been known to fail to set the Urgent Pointer to zero 1458 when the URG bit is zero, thus leaking out the corresponding system 1459 memory contents. [Zalewski, 2008] provides further details about 1460 this issue. 1462 According to the IETF specifications, TCP's urgent mechanism simply 1463 marks an interesting point in the data stream that applications may 1464 want to skip to even before processing any other data. However, 1465 "urgent data" must still be delivered "in band" to the application. 1467 Unfortunately, virtually all TCP implementations process TCP urgent 1468 data differently. By default, the "last byte of urgent data" is 1469 delivered to the application "out of band". That is, it is not 1470 delivered as part of the normal data stream. 1472 For example, the "out of band" byte is read by an application when a 1473 recv(2) system call with the MSG_OOB flag set is issued. 1475 Most implementations provide a socket option (SO_OOBINLINE) that 1476 allows an application to override the default processing of urgent 1477 data, so that they are delivered "in band" to the application, thus 1478 providing the semantics intended by the IETF specifications. 1480 Some implementations have been found to be unable to process TCP 1481 urgent indications correctly. [Myst, 1997] originally described how 1482 TCP urgent indications could be exploited to perform a Denial of 1483 Service (DoS) attack against some TCP/IP implementations, usually 1484 leading to a system crash. 1486 The following subsections analyze the security implications of the 1487 TCP urgent mechanism. Section 3.9.1 discusses the security 1488 implications arising from the different possible semantics for the 1489 urgent pointer and for the TCP urgent indications. Section 3.9.2 1490 discusses the security implications that may arise when systems 1491 implement the TCP urgent mechanism as "out of band" data. 1493 3.9.1. Security implications arising from ambiguities in the processing 1494 of urgent indications 1496 As discussed in Section 3.9, there exists some ambiguity with respect 1497 to how a receiving application may process the TCP urgent indications 1498 sent by the peer application. Firstly, the different possible 1499 semantics of the urgent pointer create ambiguity with respect to 1500 which of the bytes in the data stream are considered to be "urgent 1501 data". Secondly, some applications may process these urgent data "in 1502 band" (either if TCP urgent data is implemented as intended by the 1503 IETF specifications, or if the application sets the SO_OOBINLINE 1504 socket option), while others may process them "out of band" (e.g., as 1505 a result of a recv(2) call with the MSG_OOB option set). Thirdly, 1506 some TCP implementations keep a buffer of a single byte for storing 1507 the "urgent byte" that is delivered "out of band" to the application. 1508 Thus, if successive indications of urgent data are received before 1509 the application reads the pending "out of band" byte, the pending 1510 byte will be discarded (i.e., overwritten by the new byte of urgent 1511 data). Fourthly, some middle-boxes clear the URG bit and reset the 1512 Urgent field to zero before forwarding a packet, thus essentially 1513 eliminating the "urgent" indication. 1515 [Cisco, 2008a] provides documentation of such a middle-box. 1517 All these considerations make it difficult for Network Intrusion 1518 Detection Systems (NIDS) to monitor the application-layer data stream 1519 transferred to the screened systems, thus potentially leading to 1520 false negatives or false positives. 1522 [Ko et al, 2001] describes some of the possible ways to exploit TCP 1523 urgent data to evade Network Intrusion Detection Systems (NIDS). 1525 Considering the security implications of the TCP urgent mechanism, 1526 and given that widely-deployed middle-boxes clear the URG bit and 1527 reset the Urgent Pointer to zero (thus making the urgent indication 1528 unreliable), we discourage the use of the TCP urgent mechanism by 1529 applications. 1531 We also recommend that those legacy applications that depend on the 1532 TCP urgent mechanism set the SO_OOBINLINE socket option, so that 1533 urgent data are delivered "in band" to the application running on top 1534 of TCP. 1536 Packet scrubbers might consider clearing the URG bit, and setting the 1537 Urgent Pointer to zero, thus eliminating the urgent indication and 1538 causing urgent data to be processed in-line regardless of the 1539 semantics in use at the destination system for the TCP urgent 1540 indications. However, this might cause interoperability problems 1541 and/or undesired behavior that should be considered before enabling 1542 such behavior in packet scrubbers. 1544 3.9.2. Security implications arising from the implementation of the 1545 urgent mechanism as "out of band" data 1547 As described in the previous sub-section, some implementations keep a 1548 buffer of a single byte for storing the "urgent byte" that is 1549 delivered "out of band" to the application running on top of TCP. If 1550 successive indications of urgent data are received before the 1551 application reads the pending "urgent" byte, the pending byte is 1552 discarded (i.e., overwritten by the new byte of urgent data). This 1553 makes it difficult for a NIDS to track the application-layer data 1554 transferred to the monitored system, as some of the urgent data might 1555 (or might not) end up being discarded at the destination system, 1556 depending on the timing of the arriving segments and the consumption 1557 of urgent data by the application (assuming the SO_OOBINLINE socket 1558 option has not been set). 1560 In order to avoid urgent data being discarded, some implementations 1561 queue each of the received "urgent bytes", so that even if another 1562 urgent indication is received before the pending urgent data are 1563 consumed by the application, those bytes do not need to be discarded. 1564 Unfortunately, some of these implementations have been known to fail 1565 to enforce any limits on the amount of urgent data that they queue. 1566 As a result, an attacker could exhaust the kernel memory of such TCP 1567 implementations by sending successive TCP segments that carry urgent 1568 data. 1570 TCP implementations that queue urgent data for "out of band" 1571 processing should enforce per-connection limits on the amount of 1572 urgent data that they queue. 1574 3.10. Options 1576 [IANA, 2007] contains the official list of the assigned option 1577 numbers. [Hoenes, 2007] contains an un-official updated version of 1578 the IANA list of assigned option numbers. The following table 1579 contains a summary of the assigned TCP option numbers, which is based 1580 on [Hoenes, 2007]. 1582 +--------+----------------------+-----------------------------------+ 1583 | Kind | Meaning | Summary | 1584 +--------+----------------------+-----------------------------------+ 1585 | 0 | End of Option List | Discussed in Section 4.1 | 1586 +--------+----------------------+-----------------------------------+ 1587 | 1 | No-Operation | Discussed in Section 4.2 | 1588 +--------+----------------------+-----------------------------------+ 1589 | 2 | Maximum Segment Size | Discussed in Section 4.3 | 1590 +--------+----------------------+-----------------------------------+ 1591 | 3 | WSOPT - Window Scale | Discussed in Section 4.6 | 1592 +--------+----------------------+-----------------------------------+ 1593 | 4 | SACK Permitted | Discussed in Section 4.4.1 | 1594 +--------+----------------------+-----------------------------------+ 1595 | 5 | SACK | Discussed in Section 4.4.2 | 1596 +--------+----------------------+-----------------------------------+ 1597 | 6 | Echo (obsoleted by | Obsolete. Specified in RFC 1072 | 1598 | | option 8) | [Jacobson and Braden, 1988] | 1599 +--------+----------------------+-----------------------------------+ 1600 | 7 | Echo Reply | Obsolete. Specified in RFC 1072 | 1601 | | (obsoleted by option | [Jacobson and Braden, 1988] | 1602 | | 8) | | 1603 +--------+----------------------+-----------------------------------+ 1604 | 8 | TSOPT - Time Stamp | Discussed in Section 4.7 | 1605 | | Option | | 1606 +--------+----------------------+-----------------------------------+ 1607 | 9 | Partial Order | Historic. Specified in RFC 1693 | 1608 | | Connection Permitted | [Connolly et al, 1994] | 1609 +--------+----------------------+-----------------------------------+ 1610 | 10 | Partial Order | Historic. Specified in RFC 1693 | 1611 | | Service Profile | [Connolly et al, 1994] | 1612 +--------+----------------------+-----------------------------------+ 1613 | 11 | CC | Historic. Specified in RFC 1644 | 1614 | | | [Braden, 1994] | 1615 +--------+----------------------+-----------------------------------+ 1616 | 12 | CC.NEW | Historic. Specified in RFC 1644 | 1617 | | | [Braden, 1994] | 1618 +--------+----------------------+-----------------------------------+ 1619 | 13 | CC.ECHO | Historic. Specified in RFC 1644 | 1620 | | | [Braden, 1994] | 1621 +--------+----------------------+-----------------------------------+ 1622 +--------+----------------------+-----------------------------------+ 1623 | 14 | TCP Alternate | Historic. Specified in RFC 1146 | 1624 | | Checksum Request | [Zweig and Partridge, 1990] | 1625 +--------+----------------------+-----------------------------------+ 1626 | 15 | TCP Alternate | Historic. Specified in RFC 1145 | 1627 | | Checksum Data | [Zweig and Partridge, 1990] | 1628 +--------+----------------------+-----------------------------------+ 1629 | 16 | Skeeter | Historic | 1630 +--------+----------------------+-----------------------------------+ 1631 | 17 | Bubba | Historic | 1632 +--------+----------------------+-----------------------------------+ 1633 | 18 | Trailer Checksum | Historic | 1634 | | Option | | 1635 +--------+----------------------+-----------------------------------+ 1636 | 19 | MD5 Signature Option | Discussed in Section 4.5 | 1637 +--------+----------------------+-----------------------------------+ 1638 | 20 | SCPS Capabilities | Specified in [CCSDS, 2006] | 1639 +--------+----------------------+-----------------------------------+ 1640 | 21 | Selective Negative | Specified in [CCSDS, 2006] | 1641 | | Acknowledgements | | 1642 +--------+----------------------+-----------------------------------+ 1643 | 22 | Record Boundaries | Specified in [CCSDS, 2006] | 1644 +--------+----------------------+-----------------------------------+ 1645 | 23 | Corruption | Specified in [CCSDS, 2006] | 1646 | | experienced | | 1647 +--------+----------------------+-----------------------------------+ 1648 | 24 | SNAP | Historic | 1649 +--------+----------------------+-----------------------------------+ 1650 | 25 | Unassigned (released | Unassigned | 1651 | | 2000-12-18) | | 1652 +--------+----------------------+-----------------------------------+ 1653 | 26 | TCP Compression | Historic | 1654 | | Filter | | 1655 +--------+----------------------+-----------------------------------+ 1656 | 27 | Quick-Start Response | Specified in RFC 4782 [Floyd et | 1657 | | | al, 2007] | 1658 +--------+----------------------+-----------------------------------+ 1659 | 28-252 | Unassigned | Unassigned | 1660 +--------+----------------------+-----------------------------------+ 1661 | 253 | RFC3692-style | Described by RFC 4727 [Fenner, | 1662 | | Experiment 1 | 2006] | 1663 +--------+----------------------+-----------------------------------+ 1664 | 254 | RFC3692-style | Described by RFC 4727 [Fenner, | 1665 | | Experiment 2 | 2006] | 1666 +--------+----------------------+-----------------------------------+ 1668 Table 3: TCP Options 1670 There are two cases for the format of a TCP option: 1672 o Case 1: A single byte of option-kind. 1674 o Case 2: An option-kind byte, followed by an option-length byte, 1675 and the actual option-data bytes. 1677 In options of the Case 2 above, the option-length byte counts the 1678 option-kind byte and the option-length byte, as well as the actual 1679 option-data bytes. 1681 All options except "End of Option List" (Kind = 0) and "No Operation" 1682 (Kind = 1), are of "Case 2". 1684 There are a number of sanity checks that should be performed on TCP 1685 options before further option processing is done. These sanity 1686 checks help prevent a number of potential security problems, 1687 including buffer overflows. When these checks fail, the segment 1688 carrying the option should be silently dropped. 1690 For options that belong to the "Case 2" described above, the 1691 following check should be performed: 1693 option-length >= 2 1695 The value "2" accounts for the option-kind byte and the option-length 1696 byte, and assumes zero bytes of option-data. 1698 This check prevents, among other things, loops in option processing 1699 that may arise from incorrect option lengths. 1701 Additionally, while the option-length byte of TCP options of "Case 2" 1702 allows for an option length of up to 255 bytes, there is a limit on 1703 legitimate option length imposed by the syntax of the TCP header. 1704 Therefore, for all options of "Case 2", the following check should be 1705 enforced: 1707 option-offset + option-length <= Data Offset * 4 1709 Where option-offset is the offset of the first byte of the option 1710 within the TCP header, with the first byte of the TCP header being 1711 assigned an offset of 0. 1713 If a TCP segment does not pass this check, it should be silently 1714 dropped. 1716 The aforementioned check is meant to detect forged option-length 1717 values that might make an option overlap with the TCP payload, or 1718 even go past the actual end of the TCP segment carrying the option. 1720 Section 3.1 of RFC 793 [Postel, 1981c] states that TCP must implement 1721 all the TCP options defined in that document. Additionally, a TCP 1722 implementation may support TCP extensions based on other TCP options 1723 as it sees fit, or as required by other specifications. 1725 TCP Options have been specified in the past both within the IETF and 1726 by other groups. 1728 TCP must ignore unknown TCP options, provided they pass the 1729 validation checks described earlier in this Section. In the same 1730 way, middle-boxes such as packet filters should not reject TCP 1731 segments containing "unknown" TCP options that pass the validation 1732 checks described earlier in this Section. 1734 There is renewed interest in defining new TCP options for purposes 1735 like improved connection management and maintenance, advanced 1736 congestion control schemes, and security features. The evolution of 1737 the TCP/IP protocol suite would be severely impacted by obstacles to 1738 deploying such new protocol mechanisms. 1740 In the past, TCP enhancements based on TCP options regularly have 1741 specified the exchange of a specific "enabling" option during the 1742 initial SYN/SYN-ACK handshake. Due to the severely limited TCP 1743 option space which has already become a concern, it should be 1744 expected that future specifications might introduce new options not 1745 negotiated or enabled in this way. Therefore, middle-boxes such as 1746 packet filters should not reject TCP segments containing unknown 1747 options solely because these options have not been present in the 1748 SYN/SYN-ACK handshake. 1750 The specification of particular TCP options may contain specific 1751 rules for the syntax and placement of these options. These can only 1752 be enforced by end systems implementing these options, and the 1753 relevant specifications must point out the necessary details and 1754 related security considerations, which must be followed by 1755 implementers. 1757 Some TCP implementations have been known to "echo" unknown TCP 1758 options received in incoming segments. Here we stress that TCP must 1759 not "echo" in any way unknown TCP options received in inbound TCP 1760 segments. 1762 This is at the foundation for the introduction of new TCP options, 1763 ensuring unambiguous behavior of systems not supporting a new 1764 specification. 1766 Section 4 of this document analyzes the security implications of 1767 common TCP options. 1769 3.11. Padding 1771 The TCP header padding is used to ensure that the TCP header ends and 1772 data begins on a 32-bit boundary. The padding is composed of zeros. 1774 3.12. Data 1776 The data field contains the upper-layer packet being transmitted by 1777 means of TCP. This payload is processed by the application process 1778 making use of the transport services of TCP. Therefore, the security 1779 implications of this field are out of the scope of this document. 1781 4. Common TCP Options 1783 4.1. End of Option List (Kind = 0) 1785 This option is used to indicate the "end of options" in those cases 1786 in which the end of options would not coincide with the end of the 1787 TCP header. 1789 TCP implementations are required to ignore those options they do not 1790 implement, and to be able to handle options with illegal lengths. 1791 Therefore, TCP implementations should be able to gracefully handle 1792 those TCP segments in which the End of Option List should have been 1793 present, but is missing. 1795 It is interesting to note that some TCP implementations do not use 1796 the "End of Option List" option for indicating the "end of options", 1797 but simply pad the TCP header with several "No Operation" (Kind = 1) 1798 options to meet the header length specified by the Data Offset header 1799 field. 1801 4.2. No Operation (Kind = 1) 1803 The no-operation option is basically used to allow the sending system 1804 to align subsequent options in, for example, 32-bit boundaries. 1806 This option does not have any known security implications. 1808 4.3. Maximum Segment Size (Kind = 2) 1810 The Maximum Segment Size (MSS) option is used to indicate to the 1811 remote TCP endpoint the maximum segment size this TCP is willing to 1812 receive. 1814 The advertised maximum segment size may be the result of the 1815 consideration of a number of factors. Firstly, if fragmentation is 1816 employed, the size of the IP reassembly buffer may impose a limit on 1817 the maximum TCP segment size that can be received. Considering that 1818 the minimum IP reassembly buffer size is 576 bytes, if an MSS option 1819 is not present included in the connection-establishment phase, an MSS 1820 of 536 bytes should be assumed. Secondly, if Path-MTU Discovery 1821 (specified in RFC 1191 [Mogul and Deering, 1990] and RFC 1981 [McCann 1822 et al, 1996]) is expected to be used for the connection, an 1823 artificial maximum segment size may be enforced by a TCP to prevent 1824 the remote peer from sending TCP segments which would be too large to 1825 be transmitted without fragmentation. Finally, a system connected by 1826 a low-speed link may choose to introduce an artificial maximum 1827 segment size to enforce an upper limit on the network latency that 1828 would otherwise negatively affect its interactive applications 1829 [Stevens, 1994]. 1831 The option begins with an option-kind byte which must be equal to 2. 1832 It is followed by an option-length byte which must be equal to 4, and 1833 a two-byte field that holds the actual "maximum segment size". 1835 As stated in Section 3.1 of RFC 793 [Postel, 1981c], this option can 1836 only be sent in the initial connection request (i.e., in segments 1837 with the SYN control bit set). Therefore, the following check should 1838 be enforced on a TCP segment that carries this option: 1840 SYN == 1 1842 If the segment does not pass this check, it should be silently 1843 dropped. 1845 Given the option syntax, the option length must be equal to 4. 1846 Therefore, the following check should be performed: 1848 option-length == 4 1850 If the check fails, the TCP segment should be silently dropped. 1852 The TCP specifications do not impose any requirements on the maximum 1853 segment size value that is included in the MSS option. However, 1854 there are a number of values that may cause undesirable results. 1855 Firstly, an MSS of 0 could possible "freeze" the TCP connection, as 1856 it would not allow data to be included in the payload of the TCP 1857 segments. Secondly, low values other than 0 would degrade the 1858 performance of the TCP connection (wasting more bandwidth in protocol 1859 headers than in actual data), and could potentially exhaust 1860 processing cycles at the sending TCP and/or the receiving TCP by 1861 producing an increase in the interrupt rate caused by the transmitted 1862 (or received) packets. 1864 The problems that might arise from low MSS values were first 1865 described by [Reed, 2001]. However, the community did not reach 1866 consensus on how to deal with these issues at that point. 1868 RFC 791 [Postel, 1981a] requires IP implementations to be able to 1869 receive IP datagrams of at least 576 bytes. Assuming an IPv4 header 1870 of 20 bytes, and a TCP header of 20 bytes, there should be room in 1871 each IP packet for 536 application data bytes. Therefore, the 1872 received MSS could be sanitized as follows: 1874 Sanitized_MSS = max(MSS, 536) 1876 This "sanitized" MSS value would then be used to compute the 1877 "effective send MSS" by the expression included in Section 4.2.2.6 of 1878 RFC 1122 [Braden, 1989], as follows: 1880 Eff.snd.MSS = min(Sanitized_MSS+20, MMS_S) - TCPhdrsize - 1881 IPoptionsize 1883 where: 1885 Sanitized_MSS: 1886 sanitized MSS value (the value received in the MSS option, with an 1887 enforced minimum value) 1889 MMS_S: 1890 maximum size for a transport-layer message that TCP may send 1892 TCPhdrsize: 1893 size of the TCP header, which typically was 20, but may be larger 1894 if TCP options are to be sent. 1896 IPoptionsize 1897 size of any IP options that TCP will pass to the IP layer with the 1898 current message. 1900 There are two cases to analyze when considering the possible 1901 interoperability impact of sanitizing the received MSS value: TCP 1902 connections relying on IP fragmentation and TCP connections 1903 implementing Path-MTU Discovery. In case the corresponding TCP 1904 connection relies on IP fragmentation, given that the minimum 1905 reassembly buffer size is required to be 576 bytes by RFC 791 1906 [Postel, 1981a], the adoption of 536 bytes as a lower limit is safe. 1908 In case the TCP connection relies on Path-MTU Discovery, imposing a 1909 lower limit on the adopted MSS may ignore the advice of the remote 1910 TCP on the maximum segment size that can possibly be transmitted 1911 without fragmentation. As a result, this could lead to the first TCP 1912 data segment to be larger than the Path-MTU. However, in such a 1913 scenario, the TCP segment should elicit an ICMP Unreachable 1914 "fragmentation needed and DF bit set" error message that would cause 1915 the "effective send MSS" (E_MSS) to be decreased appropriately. 1916 Thus, imposing a lower limit on the accepted MSS will not cause any 1917 interoperability problems. 1919 A possible scenario exists in which the proposed enforcement of a 1920 lower limit in the received MSS might lead to an interoperability 1921 problem. If a system was attached to the network by means of a link 1922 with an MTU of less than 576 bytes, and there was some intermediate 1923 system which either silently dropped (i.e., without sending an ICMP 1924 error message) those packets equal to or larger than that 576 bytes, 1925 or some intermediate system simply filtered ICMP "fragmentation 1926 needed and DF bit set" error messages, the proposed behavior would 1927 not lead to an interoperability problem, when communication could 1928 have otherwise succeeded. However, the interoperability problem 1929 would really be introduced by the network setup (e.g., the middle-box 1930 silently dropping packets), rather than by the mechanism proposed in 1931 this section. In any case, TCP should nevertheless implement a 1932 mechanism such as that specified by RFC 4821 [Mathis and Heffner, 1933 2007] to deal with this type of "network black-holes". 1935 4.4. Selective Acknowledgement Option 1937 The Selective Acknowledgement option provides an extension to allow 1938 the acknowledgement of individual segments, to enhance TCP's loss 1939 recovery. 1941 Two options are involved in the SACK mechanism. The "Sack-permitted 1942 option" is sent during the connections-establishment phase, to 1943 advertise that SACK is supported. If both TCP peers agree to use 1944 selective acknowledgements, the actual selective acknowledgements are 1945 sent, if needed, by means of "SACK options". 1947 4.4.1. SACK-permitted Option (Kind = 4) 1949 The SACK-permitted option is meant to advertise that the TCP sending 1950 this segment supports Selective Acknowledgements. The SACK-permitted 1951 option can be sent only in SYN segments. Therefore, the following 1952 check should be performed on TCP segments that contain this option: 1954 SYN == 1 1956 If a segment does not pass this check, it should be silently dropped. 1958 The SACK-permitted option is composed by an option-kind octet (which 1959 must be 4), and an option-length octet which must be 2. Therefore, 1960 the following check should be performed on the option: 1962 option-length == 2 1964 If the option does not pass this check, the TCP segment carrying the 1965 option should be silently dropped. 1967 4.4.2. SACK Option (Kind = 5) 1969 The SACK option is used to convey extended acknowledgment information 1970 from the receiver to the sender over an established TCP connection. 1972 The option consists of an option-kind byte (which must be 5), an 1973 option-length byte, and a variable number of SACK blocks. Given that 1974 the space in the TCP header is limited, the following check should be 1975 enforced on the option field: 1977 option-offset + option-length <= Data Offset * 4 1979 If the option does not pass this check, the TCP carrying the option 1980 should be silently dropped. 1982 A SACK Option with zero SACK blocks is nonsensical. Therefore, the 1983 following check should be performed: 1985 option-length >= 10 1987 The value "10" accounts for the option-kind byte, the option-length 1988 byte, a 4-byte left-edge field, and a 4-byte right-edge field. 1990 Furthermore, as stated in Section 3 of RFC 2018 [Mathis et al, 1996], 1991 a SACK option that specifies n blocks will have a length of 8*n+2. 1992 Therefore, the following check should be performed: 1994 (option-length - 2) % 8 == 0 1996 If the option-length field does not pass this check, the TCP segment 1997 carrying the option should be silently dropped. 1999 Each block included in a SACK option represents a number of received 2000 data bytes that are contiguous and isolated; that is, the bytes just 2001 below the block, (Left Edge of Block - 1), and just above the block, 2002 (Right Edge of Block), have not yet been received. 2004 For obvious reasons, for each block included in the option-data, the 2005 following check should be enforced: 2007 Left Edge of Block < Right Edge of Block 2009 As in all the other occurrences in this document, all comparisons 2010 between sequence numbers should be performed using sequence number 2011 arithmetic. 2013 If any block contained in the option does not pass this check, the 2014 TCP segment should be silently dropped. 2016 Potential of resource-exhaustion attacks 2018 The TCP receiving a SACK option is expected to keep track of the 2019 selectively-acknowledged blocks. Even when space in the TCP header 2020 is limited (and thus each TCP segment can selectively-acknowledge at 2021 most four blocks of data), an attacker could try to perform a buffer 2022 overflow or a resource-exhaustion attack by sending a large number of 2023 SACK options. 2025 For example, an attacker could send a large number of SACK options, 2026 each of them acknowledging one byte of data. Additionally, for the 2027 purpose of wasting resources on the attacked system, each of these 2028 blocks would be separated from each other by one byte, to prevent the 2029 attacked system from coalescing two (or more) contiguous SACK blocks 2030 into a single SACK block. If the attacked system kept track of each 2031 SACKed block by storing both the Left Edge and the Right Edge of the 2032 block, then for each window of data, the attacker could waste up to 4 2033 * Window bytes of memory at the attacked TCP. 2035 The value "4 * Window" results from the expression "(Window / 2) * 2036 8", in which the value "2" accounts for the 1-byte block selectively- 2037 acknowledged by each SACK block and 1 byte that would be used to 2038 separate each SACK blocks from each other, and the value "8" accounts 2039 for the 8 bytes needed to store the Left Edge and the Right Edge of 2040 each SACKed block. 2042 Therefore, it is clear that a limit should be imposed on the number 2043 of SACK blocks that a TCP will store in memory for each connection at 2044 any time. Measurements in [Dharmapurikar and Paxson, 2005] indicate 2045 that in the vast majority of cases connections have a single hole in 2046 the data stream at any given time. Thus, a limit of 16 SACK blocks 2047 for each connection would handle even most of the more unusual cases 2048 in which there is more than one simultaneous hole at a time. 2050 4.5. MD5 Option (Kind=19) 2052 The TCP MD5 option provides a mechanism for authenticating TCP 2053 segments with a 18-byte digest produced by the MD5 algorithm. The 2054 option consists of an option-kind byte (which must be 19), an option- 2055 length byte (which must be 18), and a 16-byte MD5 digest. 2057 As with all TCP options of "Case 2", the following check should be 2058 enforced on the option-length field: 2060 option-offset + option-length <= Data Offset * 4 2062 If the option does not pass this check, the TCP segment carrying the 2063 option should be silently dropped. 2065 Given that the MD5 has a fixed length, the following check should be 2066 performed on the MD5 option: 2068 option-length == 18 2070 If the option does not pass this check, the TCP segment containing 2071 the option should be silently dropped. 2073 A basic weakness on the TCP MD5 option is that the MD5 algorithm 2074 itself has been known (for a long time) to be vulnerable to collision 2075 search attacks. 2077 [Bellovin, 2006] argues that it has two other weaknesses, namely that 2078 it does not provide a key identifier, and that it has no provision 2079 for automated key management. However, it is generally accepted that 2080 while a Key-ID field can be a good approach for providing smooth key 2081 rollover, it is not actually a requirement. For instance, most 2082 systems implementing the TCP MD5 option include a "keychain" 2083 mechanism that fully supports smooth key rollover. Additionally, 2084 with some further work, ISAKMP/IKE could be used to configure the MD5 2085 keys. 2087 There are a number of ongoing efforts within the IETF to develop a 2088 replacement for the address the weaknesses of the basic TCP MD5 2089 option. Some of them aim at completely replacing the TCP MD5 option, 2090 while others aim at improving the current option by, for example, 2091 standardizing mechanisms for re-keying. 2093 It is interesting to note that while the TCP MD5 option, as specified 2094 by RFC 2385 [Heffernan, 1998], addresses the TCP-based forgery 2095 attacks against TCP discussed in Section 11, it does not address the 2096 ICMP-based connection-reset attacks discussed in Section 15. As a 2097 result, while a TCP connection may be protected from TCP-based 2098 forgery attacks by means of the MD5 option, an attacker might still 2099 be able to successfully perform the ICMP-based counter-part. 2101 4.6. Window scale option (Kind = 3) 2103 The window scale option provides a mechanism to expand the definition 2104 of the TCP window to 32 bits, such that the performance of TCP can be 2105 improved in some network scenarios. 2107 [Welzl, 2008] describes major problems with the use of the Window 2108 scale option in the Internet due to faulty equipment. 2110 The Window scale option consists of an option-kind byte (which must 2111 be 3), followed by an option-length byte (which must be 3), and a 2112 shift count (shift.cnt) byte (the actual option-data). 2114 The option may be sent only in the initial SYN segment, but may also 2115 be sent in a SYN/ACK segment if the option was received in the 2116 initial SYN segment. If the option is received in any other segment, 2117 it should be silently dropped. 2119 As discussed above, the option-length must be 3. Therefore, the 2120 following check should be enforced: 2122 option-length == 3 2124 If the option does not pass this check, the TCP segment carrying this 2125 option should be silently ignored. 2127 As discussed in Section 2.3 of RFC 1323 [Jacobson et al, 1992], in 2128 order to prevent new data from being mistakenly considered as old and 2129 vice versa, the resulting window should be equal to or smaller than 2130 2^32. Therefore, an upper limit should be enforced on the shift 2131 count (shift.cnt): 2133 shift.cnt <= 14 2135 If the option does not pass this check, the option-data should be set 2136 to 14. 2138 While there are not known security implications arising from the 2139 window scale mechanism itself, the size of the TCP window has a 2140 number of security implications. In general, larger window sizes 2141 increase the chances of an attacker from successfully performing 2142 forgery attacks against TCP, such as those described in Section 11 of 2143 this document. Additionally, large windows can exacerbate the impact 2144 of resource exhaustion attacks such as those described in Section 7 2145 of this document. 2147 Section 3.7 provides a general discussion of the security 2148 implications of the TCP window size. Section 7.3.2 discusses the 2149 security implications of Automatic receive-buffer tuning mechanisms. 2151 4.7. Timestamps option (Kind = 8) 2153 The Timestamps option, specified in RFC 1323 [Jacobson et al, 1992], 2154 is used to perform two functions: Round-Trip Time Measurement (RTTM), 2155 and Protection Against Wrapped Sequence Numbers (PAWS). As defined 2156 by RFC 1323, the option-length must be 10. Therefore, the following 2157 check should be enforced: 2159 option-length == 10 2161 If the option does not pass this check, the TCP segment carrying the 2162 option should be silently dropped. 2164 4.7.1. Generation of timestamps 2166 For the purpose of PAWS, the timestamps sent on a connection are 2167 required to be monotonically increasing. While there is no 2168 requirement that timestamps are monotonically increasing across TCP 2169 connections, the generation of timestamps such that they are 2170 monotonically increasing across connections between the same two 2171 endpoints allows the use of timestamps for improving the handling of 2172 SYN segments that are received while the corresponding four-tuple is 2173 in the TIME-WAIT state. This is discussed in Section 11.1.2 of this 2174 document. 2176 We therefore recommend that timestamps are generated with a similar 2177 algorithm to that introduced by RFC 1948 [Bellovin, 1996] for the 2178 generation of Initial Sequence Numbers (ISNs). That is: 2180 timestamp = T() + F(localhost, localport, remotehost, remoteport, 2181 secret_key) 2183 where the result of T() is a global system clock that complies with 2184 the requirements of Section 4.2.2 of RFC 1323 [Jacobson et al, 1992], 2185 and F() is a function that should not be computable from the outside. 2186 Therefore, we suggest F() to be a cryptographic hash function of the 2187 connection-id and some secret data. 2189 F() provides an offset that will be the same for all incarnations of 2190 a connection between the same two endpoints, while T() provides the 2191 monotonically increasing values that are needed for PAWS. 2193 [Gont, 2008c] is CPNI's effort at the IETF to document this 2194 recommended scheme for generating TCP timestamps. 2196 4.7.2. Vulnerabilities 2198 Blind In-Window Attacks 2200 Segments that contain a timestamp option smaller than the last 2201 timestamp option recorded by TCP are silently dropped. This allows 2202 for a subtle attack against TCP that would allow an attacker to cause 2203 one direction of data transfer of the attacked connection to freeze 2204 [US-CERT, 2005c]. An attacker could forge a TCP segment that 2205 contains a timestamp that is much larger than the last timestamp 2206 recorded for that direction of the data transfer of the connection. 2207 The offending segment would cause the recorded timestamp (TS.Recent) 2208 to be updated and, as a result, subsequent segments sent by the 2209 impersonated TCP peer would be simply dropped by the receiving TCP. 2210 This vulnerability has been documented in [US-CERT, 2005d]. However, 2211 it is worth noting that exploitation of this vulnerability requires 2212 an attacker to guess (or know) the four-tuple {IP Source Address, IP 2213 Destination Address, TCP Source Port, TCP Destination Port}, as well 2214 a valid Sequence Number and a valid Acknowledgement Number. If an 2215 attacker has such detailed knowledge about a TCP connection, unless 2216 TCP segments are protected by proper authentication mechanisms (such 2217 as IPsec [Kent and Seo, 2005]), he can perform a variety of attacks 2218 against the TCP connection, even more devastating than the one just 2219 described. 2221 Information leaking 2223 Some implementations are known to maintain a global timestamp clock, 2224 which is used for all connections. This is undesirable, as an 2225 attacker that can establish a connection with a host would learn the 2226 timestamp used for all the other connections maintained by that host, 2227 which could be useful for performing any attacks that require the 2228 attacker to forge TCP segments. A timestamps generator such as the 2229 one recommended in Section 4.7.1 of this document would prevent this 2230 information leakage, as it separates the "timestamps space" among the 2231 different TCP connections. 2233 Some implementations are known to initialize their global timestamp 2234 clock to zero when the system is bootstrapped. This is undesirable, 2235 as the timestamp clock would disclose the system uptime. A 2236 timestamps generator such as the one recommended in Section 4.7.1 of 2237 this document would prevent this information leakage, as the function 2238 F() introduces an "offset" that does not disclose the system uptime. 2240 As discussed in Section 3.2 of RFC 1323 [Jacobson et al, 1992], the 2241 Timestamp Echo Reply field (TSecr) is only valid if the ACK bit of 2242 the TCP header is set, and its value must be zero when it is not 2243 valid. However, some TCP implementations have been found to fail to 2244 set the Timestamp Echo Reply field (TSecr) to zero in TCP segments 2245 that do not have the ACK bit set, thus potentially leaking 2246 information. We stress that TCP implementations should comply with 2247 RFC 1323 by setting the Timestamp Echo Reply field (TSecr) to zero in 2248 those TCP segments that do not have the ACK bit set, thus eliminating 2249 this potential information leakage. 2251 Finally, it should be noted that the Timestamps option can be 2252 exploited to count the number of systems behind NATs (Network Address 2253 Translators) [Srisuresh and Egevang, 2001]. An attacker could count 2254 the number of systems behind a NAT by establishing a number of TCP 2255 connections (using the public address of the NAT) and indentifying 2256 the number of different timestamp sequences. This information 2257 leakage could be eliminated by rewriting the contents of the 2258 Timestamps option at the NAT. [Gont and Srisuresh, 2008] provides a 2259 detailed discussion of the security implications of NATs, and 2260 proposes mitigations for this and other issues. 2262 5. Connection-establishment mechanism 2264 The following subsections describe a number of attacks that can be 2265 performed against TCP by exploiting its connection-establishment 2266 mechanism. 2268 5.1. SYN flood 2270 TCP uses a mechanism known as the "three-way handshake" for the 2271 establishment of a connection between two TCP peers. RFC 793 2272 [Postel, 1981c] states that when a TCP that is in the LISTEN state 2273 receives a SYN segment (i.e., a TCP segment with the SYN flag set), 2274 it must transition to the SYN-RECEIVED state, record the control 2275 information (e.g., the ISN) contained in the SYN segment in a 2276 Transmission Control Block (TCB), and respond with a SYN/ACK segment. 2278 A Transmission Control Block is the data structure used to store 2279 (usually within the kernel) all the information relevant to a TCP 2280 connection. The concept of "TCB" is introduced in the core TCP 2281 specification RFC 793 [Postel, 1981c]. 2283 In practice, virtually all existing implementations do not modify the 2284 state of the TCP that was in the LISTEN state, but rather create a 2285 new TCP (i.e., a new "protocol machine"), and perform all the state 2286 transitions on this newly-created TCP. This allows the application 2287 running on top of TCP to service to more than one client at the same 2288 time. As a result, each connection request results in the allocation 2289 of system memory to store the TCB associated with the newly created 2290 TCB. 2292 If TCP was implemented strictly as described in RFC 793, the 2293 application running on top of TCP would have to finish servicing the 2294 current client before being able to service the next one in line, or 2295 should instead be able to perform some kind of connection hand-off. 2297 An attacker could exploit TCP's connection-establishment mechanism to 2298 perform a Denial of Service (DoS) attack, by sending a large number 2299 of connection requests to the target system, with the intent of 2300 exhausting the system memory destined for storing TCBs (or related 2301 kernel data structures), thus preventing the attacked system from 2302 establishing new connections with legitimate users. This attack is 2303 widely known as "SYN flood", and has received a lot of attention 2304 during the late 90's [CERT, 1996]. 2306 Given that the attacker does not need to complete the three-way 2307 handshake for the attacked system to tie system resources to the 2308 newly created TCBs, he will typically forge the source IP address of 2309 the malicious SYN segments he sends, thus concealing his own IP 2310 address. 2312 If the forged IP addresses corresponded to some reachable system, the 2313 impersonated system would receive the SYN/ACK segment sent by the 2314 attacked host (in response to the forged SYN segment), which would 2315 elicit an RST segment. This RST segment would be delivered to the 2316 attacked system, causing the corresponding connection to be aborted, 2317 and the corresponding TCB to be removed. 2319 As the impersonated host would not have any state information for the 2320 TCP connection being referred to by the SYN/ACK segment, it would 2321 respond with a RST segment, as specified by the TCP segment 2322 processing rules of RFC 793 [Postel, 1981c]. 2324 However, if the forged IP source addresses were unreachable, the 2325 attacked TCP would continue retransmitting the SYN/ACK segment 2326 corresponding to each connection request, until timing out and 2327 aborting the connection. For this reason, a number of widely 2328 available attack tools first check whether each of the (forged) IP 2329 addresses are reachable by sending an ICMP echo request to them. The 2330 receipt of an ICMP echo response is considered an indication of the 2331 IP address being reachable (and thus results in the corresponding IP 2332 address not being used for performing the attack), while the receipt 2333 of an ICMP unreachable error message is considered an indication of 2334 the IP address being unreachable (and thus results in the 2335 corresponding IP address being used for performing the attack). 2337 [Gont, 2008b] describes how the so-called ICMP soft errors could be 2338 used by TCP to abort connections in any of the non-synchronized 2339 states. While implementation of the mechanism described in that 2340 document would certainly not eliminate the vulnerability of TCP to 2341 SYN flood attacks (as the attacker could use addresses that are 2342 simply "black-holed"), it provides an example of how signaling 2343 information such as that provided by means of ICMP error messages can 2344 provide valuable information that a transport protocol could use to 2345 perform heuristics. 2347 In order to mitigate the impact of this attack, the amount of 2348 information stored for non-established connections should be reduced 2349 (ideally, non-synchronized connections should not require any state 2350 information to be maintained at the TCP performing the passive OPEN). 2351 There are basically two mitigation techniques for this vulnerability: 2352 a syn-cache and syn-cookies. 2354 [Borman, 1997] and RFC 4987 [Eddy, 2007] contain a general discussion 2355 of SYN-flooding attacks and common mitigation approaches. 2357 The syn-cache [Lemon, 2002] approach aims at reducing the amount of 2358 state information that is maintained for connections in the SYN- 2359 RECEIVED state, and allocates a full TCB only after the connection 2360 has transited to the ESTABLISHED state. 2362 The syn-cookie [Bernstein, 1996] approach aims at completely 2363 eliminating the need to maintain state information at the TCP 2364 performing the passive OPEN, by encoding the most elementary 2365 information required to complete the three-way handshake in the 2366 Sequence Number of the SYN/ACK segment that is sent in response to 2367 the received SYN segment. Thus, TCP is relieved from keeping state 2368 for connections in the SYN-RECEIVED state. 2370 The syn-cookie approach has a number of drawbacks: 2372 o Firstly, given the limited space in the Sequence Number field, it 2373 is not possible to encode all the information included in the 2374 initial segment, such as, for example, support of Selective 2375 Acknowledgements (SACK). 2377 o Secondly, in the event that the Acknowledgement segment sent in 2378 response to the SYN/ACK sent by the TCP that performed the passive 2379 OPEN (i.e., the TCP server) were lost, the connection would end up 2380 in the ESTABLISHED state on the client-side, but in the CLOSED 2381 state on the server side. This scenario is normally handled in 2382 TCP by having the TCP server retransmit its SYN/ACK. However, if 2383 syn-cookies are enabled, there would be no connection state 2384 information on the server side, and thus the SYN/ACK would never 2385 be retransmitted. This could lead to a scenario in which the 2386 connection could be in the ESTABLISHED state on the client side, 2387 but in the CLOSED state at the server side. If the application 2388 protocol was such that it required the client to wait for some 2389 data from the server (e.g., a greeting message) before sending any 2390 data to the server, a deadlock would take place, with the client 2391 application waiting for such server data, and the server waiting 2392 for the TCP three-way handshake to complete. 2394 o Thirdly, unless the function used to encode information in the 2395 SYN/ACK packet is cryptographically strong, an attacker could 2396 forge TCP connections in the ESTABLISHED state by forging ACK 2397 segments that would be considered as "legitimate" by the receiving 2398 TCP. 2400 o Fourthly, in those scenarios in which establishment of new 2401 connections is blocked by simply dropping segments with the SYN 2402 bit set, use of SYN cookies could allow an attacker to bypass the 2403 firewall rules, as a connection could be established by forging an 2404 ACK segment with the correct values, without the need of setting 2405 the SYN bit. 2407 As a result, syn-cookies are usually not employed as a first line of 2408 defense against SYN-flood attacks, but are only as the last resort to 2409 cope with them. For example, some TCP implementations enable syn- 2410 cookies only after a certain number of TCBs has been allocated for 2411 connections in the SYN-RECEIVED state. We recommend this 2412 implementation technique, with a syn-cache enabled by default, and 2413 use of syn-cookies triggered, for example, when the limit of TCBs for 2414 non-synchronized connections with a given port number has been 2415 reached. 2417 It is interesting to note that a SYN-flood attack should only affect 2418 the establishment of new connections. A number of books and online 2419 documents seem to assume that TCP will not be able to respond to any 2420 TCP segment that is meant for a TCP port that is being SYN-flooded 2421 (e.g., respond with an RST segment upon receipt of a TCP segment that 2422 refers to a non-existent TCP connection). While SYN-flooding attacks 2423 have been successfully exploited in the past for achieving such a 2424 goal [Shimomura, 1995], as clarified by RFC 1948 [Bellovin, 1996] the 2425 effectiveness of SYN flood attacks to silence a TCP implementation 2426 arose as a result of a bug in the 4.4BSD TCP implementation [Wright 2427 and Stevens, 1994], rather than from a theoretical property of SYN- 2428 flood attacks themselves. Therefore, those TCP implementations that 2429 do not suffer from such a bug should not be silenced as a result of a 2430 SYN-flood attack. 2432 [Zuquete, 2002] describes a mechanism that could theoretically 2433 improve the functionality of SYN cookies. It exploits the TCP 2434 "simultaneous open" mechanism, as illustrated in Figure 5. 2436 See Figure 5, in page 46 of the UK CPNI document. 2438 Use of TCP simultaneous open for handling SYN floods 2440 In line 1, TCP A initiates the connection-establishment phase by 2441 sending a SYN segment to TCP B. In line 2, TCP B creates a SYN cookie 2442 as described by [Bernstein, 1996], but does not set the ACK bit of 2443 the segment it sends (thus really sending a SYN segment, rather than 2444 a SYN/ACK). This "fools" TCP A into thinking that both SYN segments 2445 "have crossed each other in the network" as if a "simultaneous open" 2446 scenario had taken place. As a result, in line 3 TCP A sends a SYN/ 2447 ACK segment containing the same options that were contained in the 2448 original SYN segment. In line 4, upon receipt of this segment, TCP 2449 processes the cookie encoded in the ACK field as if it had been the 2450 result of a traditional SYN cookie scenario, and moves the connection 2451 into the ESTABLISHED state. In line 5, TCP B sends a SYN/ACK 2452 segment, which causes the connection at TCP A to move into the 2453 ESTABLISHED state. In line 6, TCP A sends a data segment on the 2454 connection. 2456 While this mechanism would work in theory, unfortunately there are a 2457 number of factors that prevent it from being usable in real network 2458 environments: 2460 o Some systems are not able to perform the "simultaneous open" 2461 operation specified in RFC 793, and thus the connection 2462 establishment will fail. 2464 o Some firewalls might prevent the establishment of TCP connections 2465 that rely on the "simultaneous open" mechanism (e.g., a given 2466 firewall might be allowing incoming SYN/ACK segments, but not 2467 outgoing SYN/ACK segments). 2469 Therefore, we do not recommend implementation of this mechanism for 2470 mitigating SYN-flood attacks. 2472 5.2. Connection forgery 2474 The process of causing a TCP connection to be illegitimately 2475 established between two arbitrary remote peers is usually referred to 2476 as "connection spoofing" or "connection forgery". This can have a 2477 great negative impact when systems establish some sort of trust 2478 relationships based on the IP addresses used to establish a TCP 2479 connection [daemon9 et al, 1996]. 2481 It should be stressed that hosts should not establish trust 2482 relationships based on the IP addresses [CPNI, 2008] or on the TCP 2483 ports in use for the TCP connection (see Section 3.1 and Section 3.2 2484 of this document). 2486 One of the underlying weaknesses that allow this vulnerability to be 2487 more easily exploited is the use of an inadequate Initial Sequence 2488 Number (ISN) generator, as explained back in the 80's in [Morris, 2489 1985]. As discussed in Section 3.3.1 of this document, any TCP 2490 implementation that makes use of an inadequate ISN generator will be 2491 more vulnerable to this type of attack. A discussion of approaches 2492 for a more careful generation of Initial Sequence Numbers (ISNs) can 2493 be found in Section 3.3.1 of this document. 2495 Another attack vector for performing connection-forgery attacks is 2496 the use of IP source routing. By forging the Source Address of the 2497 IP packets that encapsulate the TCP segments of a connection, and 2498 carefully crafting an IP source route option (i.e., either LSSR or 2499 SSRR) that includes a system whose traffic he can monitor, an 2500 attacker could cause the packets sent by the attacked system (e.g., 2501 the SYN/ACK segment sent in response to the attacker's SYN segment) 2502 to be illegitimately directed to him [CPNI, 2008]. Thus, the 2503 attacker would not even need to guess valid sequence numbers for 2504 forging a TCP connection, as he would simply have direct access to 2505 all this information. As discussed in [CPNI, 2008], it is strongly 2506 recommended that systems disable IP Source Routing by default, or at 2507 the very least, they disable source routing for IP packets that 2508 encapsulate TCP segments. 2510 The IPv6 Routing Header Type 0, which provides a similar 2511 functionality to that provided by IPv4 source routing, has been 2512 officially deprecated by RFC 5095 [Abley et al, 2007]. 2514 5.3. Connection-flooding attack 2516 5.3.1. Vulnerability 2518 The creation and maintenance of a TCP connection requires system 2519 memory to maintain shared state between the local and the remote TCP. 2520 As system memory is a finite resource, there is a limit on the number 2521 of TCP connections that a system can maintain at any time. When the 2522 TCP API is employed to create a TCP connection with a remote peer, it 2523 allocates system memory for maintaining shared state with the remote 2524 TCP peer, and thus the resulting connection would tie a similar 2525 amount of resources at the remote host as at the local host. 2526 However, if special packet-crafting tools are employed to forge TCP 2527 segments to establish TCP connections with a remote peer, the local 2528 kernel implementation of TCP can be bypassed, and the allocation of 2529 resources on the attacker's system for maintaining shared state can 2530 be avoided. Thus, a malicious user could create a large number of 2531 TCP connections, and subsequently abandon them, thus tying system 2532 resources only at the remote peer. This allows an attacker to create 2533 a large number of TCP connections at the attacked system with the 2534 intent of exhausting its kernel memory, without exhausting the 2535 attacker's own resources. [CERT, 2000] discusses this vulnerability, 2536 which is usually referred to as the "Naptha attack". 2538 This attack is similar in nature to the "Netkill" attack discussed in 2539 Section 7.1.1. However, while Netkill ties both TCBs and TCP send 2540 buffers to the abandoned connections, Naptha only ties TCBs (and 2541 related kernel structures), as it doesn't issue any application 2542 requests. 2544 The symptom of this attack is an extremely large number of TCP 2545 connections in the ESTABLISHED state, which would tend to exhaust 2546 system resources and deny service to new clients (or possibly cause 2547 the system to crash). 2549 It should be noted that it is possible for an attacker to perform the 2550 same type of attack causing the abandoned connections to remain in 2551 states other than ESTABLISHED. This might be interesting for an 2552 attacker, as it is usually the case that connections in states other 2553 than ESTABLISHED usually have no controlling user-space process (that 2554 is, the former controlling process for the connection has already 2555 closed the corresponding file descriptor). 2557 A particularly interesting case of a connection-flooding attack that 2558 aims at abandoning connections in a state other than ESTABLISHED is 2559 discussed in Section 6.1 of this document. 2561 5.3.2. Countermeasures 2563 As with many other resource exhaustion attacks, the problem in 2564 generating countermeasures for this attack is that it may be 2565 difficult to differentiate between an actual attack and a legitimate 2566 high-load scenario. However, there are a number of countermeasures 2567 which, when tuned for each particular network environment, could 2568 allow a system to resist this attack and continue servicing 2569 legitimate clients. 2571 Enforcing limits on the number of connections with no user-space 2572 controlling process 2574 Connections in states other than ESTABLISHED usually have no user- 2575 space controlling process. This prevents the application making use 2576 of those connections from enforcing limits on the maximum number of 2577 ongoing connections (either on a global basis or a per-IP address 2578 basis). When resource exhaustion is imminent or some threshold of 2579 ongoing connections is reached, the operating system should consider 2580 freeing system resources by aborting connections that have no user- 2581 space controlling process. A number of such connections could be 2582 aborted on a random basis, or based on some heuristics performed by 2583 the operating system (e.g., first abort connections with peers that 2584 have the largest number of ongoing connections with no user-space 2585 controlling process). 2587 Enforcing per-user and per-process limits 2589 While the Naphta attack is usually targeted at a service such as 2590 HTTP, its impact is usually system-wide. This is particularly 2591 undesirable, as an attack against a single service might affect the 2592 system as a whole (for example, possibly precluding remote system 2593 administration). 2595 In order to avoid an attack to a single service from affecting other 2596 services, we advise TCP implementations to enforce per-process and 2597 per-user limits on maximum kernel memory that can be used at any 2598 time. Additionally, we recommend implementations to enforce per- 2599 process and per-user limits on the number of existent TCP connections 2600 at any time. 2602 Limiting the number of simultaneous connections at the application 2604 An application could limit the number of simultaneous connections 2605 that can be established from a single IP address or network prefix at 2606 any given time. Once that limit has been reached, some other 2607 connection from the same IP address or network prefix would be 2608 aborted, thus allowing the application to service this new incoming 2609 connection. 2611 There are a number of factors that should be taken into account when 2612 defining the specific limit to enforce. For example, in the case of 2613 protocols that have an authentication phase (e.g., SSH, POP3, etc.), 2614 this limit could be applied to sessions that have not yet been 2615 authenticated. Additionally, depending on the nature and use of the 2616 application, it might or might not be normal for a single system to 2617 have multiple connections to the same server at the same time. 2619 For many network services, the limit of maximum simultaneous 2620 connections could be kept very low. For example, an SMTP server 2621 could limit the number of simultaneous connections from a single IP 2622 address to 10 or 20 connections. 2624 While this limit could work in many network scenarios, we recommend 2625 network operators to measure the maximum number of concurrent 2626 connections from a single IP address during normal operation, and set 2627 the limit accordingly. 2629 In the case of web servers, this limit will usually need to be set 2630 much higher, as it is common practice for web clients to establish 2631 multiple simultaneous connections with a single web server to speed 2632 up the process of loading a web page (e.g., multiple graphic files 2633 can be downloaded simultaneously using separate TCP connections). 2635 NATs (Network Address Translators) [Srisuresh and Egevang, 2001] are 2636 widely deployed in the Internet, and may exacerbate this situation, 2637 as a large number of clients behind a NAT might each establish 2638 multiple TCP connections with a given web server, which would all 2639 appear to be originate from the same IP address (that of the NAT 2640 box). 2642 Limiting the number of simultaneous connections at firewalls 2644 Some firewalls can be configured to limit the number of simultaneous 2645 connections that any system can maintain with a specific system 2646 and/or service at any given time. Limiting the number of 2647 simultaneous connections that each system can establish with a 2648 specific system and service would effectively limit the possibility 2649 of an attacker that controls a single IP address to exhaust system 2650 resources at the attacker system/service. 2652 5.4. Firewall-bypassing techniques 2654 Some firewalls block incoming TCP connections by blocking only 2655 incoming SYN segments. However, there are inconsistencies in how 2656 different TCP implementations handle SYN segments that have 2657 additional flags set, which may allow an attacker to bypass firewall 2658 rules [US-CERT, 2003b]. 2660 For example, some firewalls have been known to mistakenly allow 2661 incoming SYN segments if they also have the RST bit set. As some TCP 2662 implementations will create a new connection in response to a TCP 2663 segment with both the SYN and RST bits set, an attacker could bypass 2664 the firewall rules and establish a connection with a "protected" 2665 system by setting the RST bit in his SYN segments. 2667 Here we advise TCP implementations to silently drop those TCP 2668 segments that have both the SYN and the RST flags set. 2670 6. Connection-termination mechanism 2672 6.1. FIN-WAIT-2 flooding attack 2673 6.1.1. Vulnerability 2675 TCP implements a connection-termination mechanism that is employed 2676 for the graceful termination of a TCP connection. This mechanism 2677 usually consists of the exchange of four-segments. Figure 6 2678 illustrates the usual segment exchange for this mechanism. 2680 Figure 6: TCP connection-termination mechanism 2682 See Figure 6, in page 50 of the UK CPNI document. 2684 TCP connection-termination mechanism 2686 A potential problem may arise as a result of the FIN-WAIT-2 state: 2687 there is no limit on the amount of time that a TCP can remain in the 2688 FIN-WAIT-2 state. Furthermore, no segment exchange is required to 2689 maintain the connection in that state. 2691 As a result, an attacker could establish a large number of 2692 connections with the target system, and cause it close each of them. 2693 For each connection, once the target system has sent its FIN segment, 2694 the attacker would acknowledge the receipt of this segment, but would 2695 send no further segments on that connection. As a result, an 2696 attacker could cause the corresponding system resources (e.g., the 2697 system memory used for storing the TCB) without the need to send any 2698 further packets. 2700 While the CLOSE command described in RFC 793 [Postel, 1981c] simply 2701 signals the remote TCP end-point that this TCP has finished sending 2702 data (i.e., it closes only one direction of the data transfer), the 2703 close() system-call available in most operating systems has different 2704 semantics: it marks the corresponding file descriptor as closed (and 2705 thus it is no longer usable), and assigns the operating system the 2706 responsibility to deliver any queued data to the remote TCP peer and 2707 to terminate the TCP connection. This makes the FIN-WAIT-2 state 2708 particularly attractive for performing memory exhaustion attacks, as 2709 even if the application running on top of TCP were imposing limits on 2710 the maximum number of ongoing connections, and/or time limits on the 2711 function calls performed on TCP connections, that application would 2712 be unable to enforce these limits on the FIN-WAIT-2 state. 2714 6.1.2. Countermeasures 2716 A number of countermeasures can be implemented to mitigate FIN-WAIT-2 2717 flooding attacks. Some of these countermeasures require changes in 2718 the TCP implementations, while others require changes in the 2719 applications running on top of TCP. 2721 Enforcing limits on the number of connections with no user-space 2722 controlling process 2724 The considerations and recommendations in Section 5.3.2 for enforcing 2725 limits on the number of connections with no user-space controlling 2726 process are applicable to mitigate this vulnerability. 2728 Enforcing limits on the duration of the FIN-WAIT-2 state 2730 In order to avoid the risk of having connections stuck in the FIN- 2731 WAIT-2 state indefinitely, a number of systems incorporate a timeout 2732 for the FIN-WAIT-2 state. For example, the Linux kernel version 2.4 2733 enforces a timeout of 60 seconds [Linux, 2008]. If the connection- 2734 termination mechanism does not complete before that timeout value, it 2735 is aborted. 2737 We advise the implementation of such a timeout for the FIN-WAIT-2 2738 state. 2740 Enabling applications to enforce limits on ongoing connections 2742 As discussed in Section 6.1.1, the fact that the close() system call 2743 marks the corresponding file descriptor as closed prevents the 2744 application running on top of TCP from enforcing limits on the 2745 corresponding connection. 2747 While it is common practice for applications to terminate their 2748 connections by means of the close() system call, it is possible for 2749 an application to initiate the connection-termination phase without 2750 closing the corresponding file descriptor (hence keeping control of 2751 the connection). 2753 In order to achieve this, an application performing an active close 2754 (i.e., initiating the connection-termination phase) should replace 2755 the system-call close(sockfd) with the following code sequence: 2757 o A call to shutdown(sockfd, SHUT_WR), to close the sending 2758 direction of this connection 2760 o Successive calls to read(), until it returns 0, thus indicating 2761 that the remote TCP peer has finished sending data. 2763 o A call to setsockopt(sockfd, SOL_SOCKET, SO_LINGER, &l, 2764 sizeof(l)), where l is of type struct linger (with its members 2765 l.l_onoff=1 and l.l_linger=90). 2767 o A call to close(sockfd), to close the corresponding file 2768 descriptor. 2770 The call to shutdown() (instead of close()) allows the application to 2771 retain control of the underlying TCP connection while the connection 2772 transitions through the FIN-WAIT-1 and FIN-WAIT-2 states. However, 2773 the application will not retain control of the connection while it 2774 transitions through the CLOSING and TIME-WAIT states. 2776 It should be noted that, strictly speaking, close(sockfd) decrements 2777 the reference count for the descriptor sockfd, and initiates the 2778 connection termination phase only when the reference count reaches 0. 2779 On the other hand, shutdown(sockfd, SHUT_WR) initiates the 2780 connection-termination phase, regardless of the reference count for 2781 the sockfd descriptor. This should be taken into account when 2782 performing the code replacement described above. For example, it 2783 would be a bug for two processes (e.g., parent and child) that share 2784 a descriptor to both call shutdown(sockfd, SHUT_WR). 2786 An application performing a passive close should replace the call to 2787 close(sockfd) with the following code sequence: 2789 o A call to setsockopt(sockfd, SOL_SOCKET, SO_LINGER, &l, 2790 sizeof(l)), where l is of type struct linger (with its members 2791 l.l_onoff=1 and l.l_linger=90). 2793 o A call to close(sockfd), to close the corresponding file 2794 descriptor. 2796 It is assumed that if the application is performing a passive close, 2797 the application already detected that the remote TCP peer finished 2798 sending data by means as a result of a call to read() returning 0. 2800 In this scenario, the application will not retain control of the 2801 underlying connection when it transitions through the LAST_ACK state. 2803 Limiting the number of simultaneous connections at the application 2805 The considerations and recommendations in Section 5.3.2 for limiting 2806 the number of simultaneous connections at the application are to 2807 mitigate this vulnerability. We note, however, that unless 2808 applications are implemented to retain control of the underlying TCP 2809 connection while the connection transitions through the FIN-WAIT-1 2810 and FIN-WAIT-2 states, enforcing such limits may prove to be a 2811 difficult task. 2813 Limiting the number of simultaneous connections at firewalls 2815 The considerations and recommendations in Section 5.3.2 for enforcing 2816 limiting the number of simultaneous connections at firewalls are 2817 applicable to mitigate this vulnerability. 2819 7. Buffer management 2821 7.1. TCP retransmission buffer 2823 7.1.1. Vulnerability 2825 [Shalunov, 2000] describes a resource exhaustion attack (Netkill) 2826 that can be performed against TCP. The attack aims at exhausting 2827 system memory by creating a large number of TCP connections which are 2828 then abandoned. The attack is usually performed as follows: 2830 o The attacker creates a TCP connection to a service in which a 2831 small client request can result in a large server response (e.g., 2832 HTTP). Rather than relying on his kernel implementation of TCP, 2833 the attacker creates his TCP connections by means of a specialized 2834 packet-crafting tool. This allows the attacker to create the TCP 2835 connections and later abandon them, exhausting the resources at 2836 the attacked system, while not tying his own system resources to 2837 the abandoned connections. 2839 o When the connection is established (i.e., the three-way handshake 2840 has completed), an application request is sent, and the TCP 2841 connection is subsequently abandoned. At this point, any state 2842 information kept by the attack tool is removed. 2844 o The attacked server allocates TCP send buffers for transmitting 2845 the response to the client's request. This causes the victim TCP 2846 to tie resources not only for the Transmission Control Block 2847 (TCB), but also for the application data that needs to be 2848 transferred. 2850 o Once the application response is queued for transmission, the 2851 application closes the TCP connection, and thus TCP takes the 2852 responsibility to deliver the queued data. Having the application 2853 close the connection has the benefit for the attacker that the 2854 application is not able to keep track of the number of TCP 2855 connections in use, and thus it is not able to enforce limits on 2856 the number of connections. 2858 o The attacker repeats the above steps a large number of times, thus 2859 causing a large amount of system memory at the victim host to be 2860 tied to the abandoned connections. When the system memory is 2861 exhausted, the victim host denies service to new connections, or 2862 possibly crashes. 2864 There are a number of factors that affect the effectiveness of this 2865 attack that are worth considering. Firstly, while the attack is 2866 typically targeted at a service such as HTTP, the consequences of the 2867 attack are usually system-wide. Secondly, depending on the size of 2868 the server's response, the underlying TCP connection may or may not 2869 be closed: if the response is larger than the TCP send buffer size at 2870 the server, the application will usually block in a call to write() 2871 or send(), and would therefore not close the TCP connection, thus 2872 allowing the application to enforce limits on the number of ongoing 2873 connections. Consequently, the attacker will usually try to elicit a 2874 response that is equal to or slightly smaller than the send buffer of 2875 the attacked TCP. Thirdly, while [Shalunov, 2000] notes that one 2876 visible effect of this attack is a large number of connections in the 2877 FIN-WAIT-1 state, this will not usually be the case. Given that the 2878 attacker never acknowledges any segment other than the SYN/ACK 2879 segment that is part of the three-way handshake, at the point in 2880 which the attacked TCP tries to send the application's response the 2881 congestion window (cwnd) will usually be 4*SMSS (four maximum-sized 2882 segments). If the application's response were larger than 4*SMSS, 2883 even if the application had closed the connection, the FIN segment 2884 would never be sent, and thus the connection would still remain in 2885 the ESTABLISHED state (rather than transit to the FIN-WAIT-1 state). 2887 7.1.2. Countermeasures 2889 The resource exhaustion attack described in Section 7.1.1 does not 2890 necessarily differ from a legitimate high-load scenario, and 2891 therefore is hard to mitigate without negatively affecting the 2892 robustness of TCP. However, complementary mitigations can still be 2893 implemented to limit the impact of these attacks. 2895 Enforcing limits on the number of connections with no user-space 2896 controlling process 2898 The considerations and recommendations in Section 5.3.2 for enforcing 2899 limits on the number of connections with no user-space controlling 2900 process are applicable to mitigate this vulnerability. 2902 Enforcing per-user and per-process limits 2904 While the Netkill attack is usually targeted at a service such as 2905 HTTP, its impact is usually system-wide. This is particularly 2906 undesirable, as an attack against a single service might affect the 2907 system as a whole (for example possibly precluding remote system 2908 administration). 2910 In order to avoid an attack against a single service from affecting 2911 other services, we advise TCP implementations to enforce per-process 2912 and per-user limits on maximum kernel memory that can be used at any 2913 time. Additionally, we recommend implementations to enforce per- 2914 process and per-user limits on the number of existent TCP connections 2915 at any time. 2917 Limiting the number of ongoing connections at the application 2919 The considerations and recommendations in Section 5.3.2 for enforcing 2920 limits on the number of ongoing connections at the application are 2921 applicable to mitigate this vulnerability. 2923 Enabling applications to enforce limits on ongoing connections 2925 As discussed in Section 6.1.1, the fact that the close() system call 2926 marks the corresponding file descriptor as closed prevents the 2927 application running on top of TCP from enforcing limits on the 2928 corresponding connection. 2930 While it is common practice for applications to terminate their 2931 connections by means of the close() system call, it is possible for 2932 an application to initiate the connection-termination phase without 2933 closing the corresponding file descriptor (hence keeping control of 2934 the connection). 2936 In order to achieve this, an application performing an active close 2937 (i.e., initiating the connection-termination phase) should replace 2938 the call to close(sockfd) with the following code sequence: 2940 o A call to shutdown(sockfd, SHUT_WR), to close the sending 2941 direction of this connection 2943 o Successive calls to read(), until it returns 0, thus indicating 2944 that the remote TCP peer has finished sending data. 2946 o A call to setsockopt(sockfd, SOL_SOCKET, SO_LINGER, &l, 2947 sizeof(l)), where l is of type struct linger (with its members 2948 l.l_onoff=1 and l.l_linger=90). 2950 o A call to close(sockfd), to close the corresponding file 2951 descriptor. 2953 The call to shutdown() (instead of close()) allows the application to 2954 retain control of the underlying TCP connection while the connection 2955 transitions through the FIN-WAIT-1 and FIN-WAIT-2 states. However, 2956 the application will not retain control of the connection while it 2957 transitions through the CLOSING and TIME-WAIT states. Nevertheless, 2958 in these states TCP should not have any pending data to send to the 2959 remote TCP peer or to be received by the application running on top 2960 of it, and thus these states are less of a concern for this 2961 particular vulnerability (Netkill). 2963 It should be noted that, strictly speaking, close(sockfd) decrements 2964 the reference count for the descriptor sockfd, and initiates the 2965 connection termination phase only when the reference count reaches 0. 2966 On the other hand, shutdown(sockfd, SHUT_WR) initiates the 2967 connection-termination phase, regardless of the reference count for 2968 the sockfd descriptor. This should be taken into account when 2969 performing the code replacement described above. For example, it 2970 would be a bug for two processes (e.g., parent and child) that share 2971 a descriptor to both call shutdown(sockfd, SHUT_WR). 2973 An application performing a passive close should replace the call to 2974 close(sockfd) with the following code sequence: 2976 o A call to setsockopt(sockfd, SOL_SOCKET, SO_LINGER, &l, 2977 sizeof(l)), where l is of type struct linger (with its members 2978 l.l_onoff=1 and l.l_linger=90). 2980 o A call to close(sockfd), to close the corresponding file 2981 descriptor. 2983 It is assumed that if the application is performing a passive close, 2984 the application already detected that the remote TCP peer finished 2985 sending data by means as a result of a call to read() returning 0. 2987 In this scenario, the application will not retain control of the 2988 underlying connection when it transitions through the LAST_ACK state. 2989 However, in this state TCP should not have any pending data to send 2990 to the remote TCP peer or to be received by the application running 2991 on top of TCP, and thus this state is less of a concern for this 2992 particular vulnerability (Netkill). 2994 Limiting the number of simultaneous connections at firewalls 2996 The considerations and recommendations in Section 5.3.2 for enforcing 2997 limiting the number of simultaneous connections at firewalls are 2998 applicable to mitigate this vulnerability. 3000 Performing heuristics on ongoing TCP connections 3002 Some heuristics could be performed on TCP connections that may 3003 possibly help if scarce system requirements such as memory become 3004 exhausted. A number of parameters may be useful to perform such 3005 heuristics. 3007 In the case of the Netkill attack described in [Shalunov, 2000], 3008 there are two parameters that are characteristic of a TCP being 3009 attacked: 3011 o A large amount of data queued in the TCP retransmission buffer 3012 (e.g., the socket send buffer). 3014 o Only small amount of data has been successfully transferred to the 3015 remote peer. 3017 Clearly, these two parameters do not necessarily indicate an ongoing 3018 attack. However, if exhaustion of the corresponding system resources 3019 was imminent, these two parameters (among others) could be used to 3020 perform heuristics when considering aborting ongoing connections. 3022 It should be noted that while an attacker could advertise a zero 3023 window to cause the target system to tie system memory to the TCP 3024 retransmission buffer, it is hard to perform any useful statistics 3025 from the advertised window. While it is tempting to enforce a limit 3026 on the length of the persist state (see Section 3.7.2 of this 3027 document), an attacker could simply open the window (i.e., advertise 3028 a TCP window larger than zero) from time to time to prevent this 3029 enforced limit from causing his malicious connections to be aborted. 3031 7.2. TCP segment reassembly buffer 3033 TCP buffers out-of-order segments to more efficiently handle the 3034 occurrence of packet reordering and segment loss. When out-of-order 3035 data are received, a "hole" momentarily exists in the data stream 3036 which must be filled before the received data can be delivered to the 3037 application making use of TCP's services. This situation can be 3038 exploited by an attacker, which could intentionally create a hole in 3039 the data stream by sending a number of segments with a sequence 3040 number larger than the next sequence number expected (RCV.NXT) by the 3041 attacked TCP. Thus, the attacked TCP would tie system memory to 3042 buffer the out-of-order segments, without being able to hand the 3043 received data to the corresponding application. 3045 If a large number of such connections were created, system memory 3046 could be exhausted, precluding the attacked TCP from servicing new 3047 connections and/or continue servicing TCP connections previously 3048 established. 3050 Fortunately, these attacks can be easily mitigated, at the expense of 3051 degrading the performance of possibly legitimate connections. When 3052 out-of-order data is received, an Acknowledgement segment is sent 3053 with the next sequence number expected (RCV.NXT). This means that 3054 receipt of the out-of-order data will not be actually acknowledged by 3055 the TCP's cumulative Acknowledgement Number. As a result, a TCP is 3056 free to discard any data that have been received out-of-order, 3057 without affecting the reliability of the data transfer. Given the 3058 performance implications of discarding out-of-order segments for 3059 legitimate connections, this pruning policy should be applied only if 3060 memory exhaustion is imminent. 3062 As a result of discarding the out-of-order data, these data will need 3063 to be unnecessarily retransmitted. Additionally, a loss event will 3064 be detected by the sending TCP, and thus the slow start phase of 3065 TCP's congestion control will be entered, thus reducing the data 3066 transfer rate of the connection. 3068 It is interesting to note that this pruning policy could be applied 3069 even if Selective Acknowledgements (SACK) (specified in RFC 2018 3070 [Mathis et al, 1996]) are in use, as SACK provides only advisory 3071 information, and does not preclude the receiving TCP from discarding 3072 data that have been previously selectively-acknowledged by means of 3073 TCP's SACK option, but not acknowledged by TCP's cumulative 3074 Acknowledgement Number. 3076 There are a number of ways in which the pruning policy could be 3077 triggered. For example, when out of order data are received, a timer 3078 could be set, and the sequence number of the out-of-order data could 3079 be recorded. If the hole were filled before the timer expires, the 3080 timer would be turned off. However, if the timer expired before the 3081 hole were filled, all the out-of-order segments of the corresponding 3082 connection would be discarded. This would be a proactive counter- 3083 measure for attacks that aim at exhausting the receive buffers. 3085 In addition, an implementation could incorporate reactive mechanisms 3086 for more carefully controlling buffer allocation when some predefined 3087 buffer allocation threshold was reached. At such point, pruning 3088 policies would be applied. 3090 A number of mechanisms can aid in the process of freeing system 3091 resources. For example, a table of network prefixes corresponding to 3092 the IP addresses of TCP peers that have ongoing TCP connections could 3093 record the aggregate amount of out-of-order data currently buffered 3094 for those connections. When the pruning policy was triggered, TCP 3095 connections with hosts that have network prefixes with large 3096 aggregate out-of-order buffered data could be selected first for 3097 pruning the out-of-order segments. 3099 Alternatively, if TCP segments were de-multiplexed by means of a hash 3100 table (as it is currently the case in many TCP implementations), a 3101 counter could be held at each entry of the hash table that would 3102 record the aggregate out-of-order data currently buffered for those 3103 connections belonging to that hash table entry. When the pruning 3104 policy is triggered, the out-of-order data corresponding to those 3105 connections linked by the hash table entry with largest amount of 3106 aggregate out-of-order data could be pruned first. It is important 3107 that this hash is not computable by an attacker, as this would allow 3108 him to maliciously cause the performance of specific connections to 3109 be degraded. That is, given a four-tuple that identifies a 3110 connection, an attacker should not be able to compute the 3111 corresponding hash value used by the target system to de-multiplex 3112 incoming TCP segments to that connection. 3114 Another variant of a resource exhaustion attack against TCP's segment 3115 reassembly mechanism would target the data structures used to link 3116 the different holes in a data stream. For example, an attacker could 3117 send a burst of 1 byte segments, leaving a one-byte hole between each 3118 of the data bytes sent. Depending on the data structures used for 3119 holding and linking together each of the data segments, such an 3120 attack might waste a large amount of system memory by exploiting the 3121 overhead needed store and link together each of these one-byte 3122 segments. 3124 For example, if a linked-list is used for holding and linking each of 3125 the data segments, each of the involved data structures could involve 3126 one byte of kernel memory for storing the received data byte (the TCP 3127 payload), plus 4 bytes (32 bits) for storing a pointer to the next 3128 node in the linked-list. Additionally, while such a data structure 3129 would require only a few bytes of kernel memory, it could result in 3130 the allocation of a whole memory page, thus consuming much more 3131 memory than expected. 3133 Therefore, implementations should enforce a limit on the number of 3134 holes that are allowed in the received data stream at any given time. 3135 When such a limit is reached, incoming TCP segments which would 3136 create new holes would be silently dropped. Measurements in 3137 [Dharmapurikar and Paxson, 2005] indicate that in the vast majority 3138 of TCP connections have at most a single hole at any given time. A 3139 limit of 16 holes for each connection would accommodate even most of 3140 the very unusual cases in which there can be more than hole in the 3141 data stream at a given time. 3143 [US-CERT, 2004a] is a security advisory about a Denial of Service 3144 vulnerability resulting from a TCP implementation that did not 3145 enforce limits on the number of segments stored in the TCP reassembly 3146 buffer. 3148 Section 8 of this document describes the security implications of the 3149 TCP segment reassembly algorithm. 3151 7.3. Automatic buffer tuning mechanisms 3152 7.3.1. Automatic send-buffer tuning mechanisms 3154 A number of TCP implementations incorporate automatic tuning 3155 mechanisms for the TCP send buffer size. In most of them, the 3156 underlying idea is to set the send buffer to some multiple of the 3157 congestion window (cwnd). This type of mechanism usually improves 3158 TCP's performance, by preventing the socket send buffer from becoming 3159 a bottleneck, while avoiding the need to simply overestimate the TCP 3160 send buffer size (i.e., make it arbitrarily large). [Semke et al, 3161 1998] discusses such an automatic buffer tuning mechanism. 3163 Unfortunately, automatic tuning mechanisms can be exploited by 3164 attackers to amplify the impact of other resource exhaustion attacks. 3165 For example, an attacker could establish a TCP connection with a 3166 victim host, and cause the congestion window to be increased (either 3167 legitimately or illegitimately). Once the congestion window (and 3168 hence the TCP send buffer) is increased, he could cause the 3169 corresponding system memory to be tied up by advertising a zero-byte 3170 TCP window (see Section 3.7) or simply not acknowledging any data, 3171 thus amplifying the effect of resource exhaustion attacks such as 3172 that discussed in Section 7.1.1. 3174 When an automatic buffer tuning mechanism is implemented, a number of 3175 countermeasures should be incorporated to prevent the mechanism from 3176 being exploited to amplify other resource exhaustion attacks. 3178 Firstly, appropriate policies should be applied to guarantee fair use 3179 of the available system memory by each of the established TCP 3180 connections. Secondly, appropriate policies should be applied to 3181 avoid existing TCP connections from consuming all system resources, 3182 thus preventing service to new TCP connections. 3184 Appendix A of [Semke et al, 1998] proposes an algorithm for the fair 3185 share of the available system memory among the established 3186 connections. However, there are a number of limits that should be 3187 enforced on the system memory assigned for the send buffer of each 3188 connection. Firstly, each connection should always be assigned some 3189 minimum send buffer space that would enable TCP to perform at an 3190 acceptable performance. Secondly, some system memory should be 3191 reserved for future connections, according to the maximum number of 3192 concurrent TCP connections that are expected to be successfully 3193 handled at any given time. 3195 As a result, the following limit should be enforced on the size of 3196 each send buffer: 3198 send_buffer_size <= send_buffer_pool / (min_buffer_size * 3199 max_connections) 3200 where 3202 send_buffer_size: 3203 Maximum send buffer size to be used for this connection 3205 send_buffer_pool: 3206 Total amount of system memory meant for TCP send buffers 3208 min_buffer_size: 3209 Minimum send buffer size for each TCP connection 3211 max_connections: 3212 Maximum number of TCP connections this system is expected to 3213 handle at a time 3215 max_connections may be an artificial limit enforced by the system 3216 administrator specifically on the number of TCP connections, or may 3217 be derived from some other system limit (e.g., the maximum number of 3218 file descriptors) 3220 These limits preclude the automatic tuning algorithm from assigning 3221 all the available memory buffers to ongoing connections, thus 3222 preventing the establishment of new connections. 3224 Even if these limits are enforced, an attacker could still create a 3225 large number of TCP connections, each of them tying valuable system 3226 resources. Therefore, in scenarios in which most of the system 3227 memory reserved for TCP send buffers is allocated to ongoing 3228 connections, it may be necessary for TCP to enforce some policy to 3229 free resources to either service more TCP connections, or to be able 3230 to improve the performance of other existing connections, by 3231 allocating more resources to them. 3233 When needing to free memory in use for send buffers, particular 3234 attention should be paid to TCP's that have a large amount of data in 3235 the socket send buffer, and that at the same time fall into any of 3236 these categories: 3238 o The remote TCP peer that has been advertising a small (possibly 3239 zero) window for a considerable period of time. 3241 o There have been a large number of retransmissions of segments 3242 corresponding to the first few windows of data. 3244 o Connections that fall into one of the previous categories, for 3245 which only a reduced amount of data have been successfully 3246 transferred to the peer TCP since the connection was established. 3248 Unfortunately, all these cases are valid scenarios for the TCP 3249 protocol, and thus aborting connections that fall in any of these 3250 categories has the potential of causing interoperability problems. 3251 However, in scenarios in which all system resources are allocated, it 3252 may make sense to free resources allocated to TCP connections which 3253 are tying a considerable amount of system resources and that have not 3254 made progress in a considerable period of time. 3256 7.3.2. Automatic receive-buffer tuning mechanism 3258 A number of TCP implementations include automatic tuning mechanisms 3259 for the receive buffer size. These mechanisms aim at setting the 3260 socket buffer to a size that is large enough to avoid the TCP window 3261 from becoming a bottleneck that would limit TCP's throughput, without 3262 wasting system memory by over-sizing it. 3264 [Heffner, 2002] describes a mechanism for the automatic tuning of the 3265 socket receive buffer. Basically, the mechanism aims at measuring 3266 the amount of data received during a RTT (Round-Trip Time), and 3267 setting the socket receive buffer to some multiple of that value. 3269 Unfortunately, automatic tuning mechanisms for the socket receive 3270 buffer can be exploited to perform a resource exhaustion attack. An 3271 attacker willing to exploit the automatic buffer tuning mechanism 3272 would first establish a TCP connection with the victim host. 3273 Subsequently, he would start a bulk data transfer to the victim host. 3274 By carefully responding to the peer's TCP segments, the attacker 3275 could cause the peer TCP to measure a large data/RTT value, which 3276 would lead to the adoption of an unnecessarily large socket receive 3277 buffer. For example, the attacker could optimistically send more 3278 data than those allowed by the TCP window advertised by the remote 3279 TCP. Those extra data would cross in the network with the window 3280 updates sent by the remote TCP, and could lead the TCP receiver to 3281 measure a data/RTT twice as big as the real one. Alternatively, if 3282 the TCP timestamp option (specified in RFC 1323 [Jacobson et al, 3283 1992]) is used for RTT measurement, the attacker could lead the TCP 3284 receiver to measure a small RTT (and hence a large Data/RTT rate) by 3285 "optimistically" echoing timestamps that have not yet been received. 3287 Finally, once the TCP receiver is led to increase the size of its 3288 receive buffer, the attacker would transmit a large amount of data, 3289 filling the whole peer's receive buffer except for a few bytes at the 3290 beginning of the window (RCV.NXT). This gap would prevent the peer 3291 application from reading the data queued by TCP, thus tying system 3292 memory to the received data segments until (if ever) the peer 3293 application times out. 3295 A number of limits should be enforced on the amount of system memory 3296 assigned to any given connection. Firstly, each connection should 3297 always be assigned some minimum receive buffer space that would 3298 enable TCP to perform at a minimum acceptable performance. 3299 Additionally, some system memory should be reserved for future 3300 connections, according to the maximum number of concurrent TCP 3301 connections that are expected to be successfully handled at any given 3302 time. 3304 As a result, the following limit should be enforced on the size of 3305 each receive buffer: 3307 recv_buffer_size <= recv_buffer_pool / (min_buffer_size * 3308 max_connections) 3310 where 3312 recv_buffer_size: 3313 Maximum receive buffer size to be used for this connection 3315 recv_buffer_pool: 3316 Total amount of system memory meant for TCP receive buffers 3318 min_buffer_size: 3319 Minimum receive buffer size for each TCP connection 3321 max_connections: 3322 Maximum number of TCP connections this system is expected to 3323 handle at a time 3325 max_connections may be an artificial limit enforced by the system 3326 administrator specifically on the number of TCP connections, or may 3327 be derived from some other system limit (e.g., the maximum number of 3328 file descriptors). 3330 These limits preclude the automatic tuning algorithm from assigning 3331 all the available memory buffers to existing connections, thus 3332 preventing the establishment of new connections. 3334 It is interesting to note that a TCP sender will always try to 3335 retransmit any data that have not been acknowledged by TCP's 3336 cumulative acknowledgement. Therefore, if memory exhaustion is 3337 imminent, a system should consider freeing those memory buffers used 3338 for TCP segments that were received out of order, particularly when a 3339 given connection has been keeping a large number of out-of-order 3340 segments in the receive buffer for a considerable period of time. 3342 It is worth noting that TCP Selective Acknowledgements (SACK) are 3343 advisory, in the sense that a TCP that has SACKed (but not ACKed) a 3344 block of data is free to discard that block, and expect the TCP 3345 sender to retransmit them when the retransmission timer of the peer 3346 TCP expires. 3348 8. TCP segment reassembly algorithm 3350 8.1. Problems that arise from ambiguity in the reassembly process 3352 A security consideration that should be made for the TCP segment 3353 reassembly algorithm is that of data stream consistency between the 3354 host performing the TCP segment reassembly, and a Network Intrusion 3355 Detection System (NIDS) being employed to monitor the host in 3356 question. 3358 In the event a TCP segment was unnecessarily retransmitted, or there 3359 was packet duplication in any of the intervening networks, a TCP 3360 might get more than one copy of the same data. Also, as TCP segments 3361 can be re-packetized when they are retransmitted, a given TCP segment 3362 might partially overlap data already received in earlier segments. 3363 In all these cases, the question arises about which of the copies of 3364 the received data should be used when reassembling the data stream. 3365 In legitimate and normal circumstances, all copies would be 3366 identical, and the same data stream would be obtained regardless of 3367 which copy of the data was used. However, an attacker could 3368 maliciously send overlapping segments containing different data, with 3369 the intent of evading a Network Intrusion Detection Systems (NIDS), 3370 which might reassemble the received TCP segments differently than the 3371 monitored system. [Ptacek and Newsham, 1998] provides a detailed 3372 discussion of these issues. 3374 As suggested in Section 3.9 of RFC 793 [Postel, 1981c], if a TCP 3375 segment arrives containing some data bytes that have already been 3376 received, the first copy of those data should be used for 3377 reassembling the application data stream. It should be noted that 3378 while convergence to this policy might prevent some cases of 3379 ambiguity in the reassembly process, there are a number of other 3380 techniques that an attacker could still exploit to evade a NIDS 3381 [CPNI, 2008]. These techniques can generally be defeated if the NIDS 3382 is placed in-line with the monitored system, thus allowing the NIDS 3383 to normalize the network traffic or apply some other policy that 3384 could ensure consistency between the result of the segment reassembly 3385 process obtained by the monitored host and that obtained by the NIDS. 3387 [CERT, 2003] and [CORE, 2003] are advisories about a heap buffer 3388 overflow in a popular Network Intrusion Detection System resulting 3389 from incorrect sequence number calculations in its TCP stream- 3390 reassembly module. 3392 9. TCP Congestion Control 3394 TCP implements two algorithms, "slow start" and "congestion 3395 avoidance", for controlling the rate at which data is transmitted on 3396 a TCP connection [Allman et al, 1999]. These algorithms require the 3397 addition of two variables as part of TCP per-connection state: cwnd 3398 and ssthresh. 3400 The congestion window (cwnd) is a sender-side limit on the amount of 3401 outstanding data that the sender can have at any time, while the 3402 receiver's advertised window (rwnd) is a receiver-side limit on the 3403 amount of outstanding data. The minimum of cwnd and rwnd governs 3404 data transmission. 3406 Another state variable, the slow-start threshold (ssthresh), is used 3407 to determine whether it is the slow start or the congestion avoidance 3408 algorithm that should control data transmission. When cwnd < 3409 ssthresh, "slow start" governs data transmission, and the congestion 3410 window (cwnd) is exponentially increased. When cwnd > ssthresh, 3411 "congestion avoidance" governs data transmission, and the congestion 3412 window (cwnd) is only linearly increased. 3414 As specified in RFC 2581 [Allman et al, 1999], when cwnd and ssthresh 3415 are equal the sender may use either slow start or congestion 3416 avoidance. 3418 During slow start, TCP increments cwnd by at most SMSS bytes for each 3419 ACK received that acknowledges new data. During congestion 3420 avoidance, cwnd is incremented by 1 full-sized segment per round-trip 3421 time (RTT), until congestion is detected. 3423 Additionally, TCP uses two algorithms, Fast Retransmit and Fast 3424 Recovery, to mitigate the effects of packet loss. The "Fast 3425 Retransmit" algorithm infers packet loss when three Duplicate 3426 Acknowledgements (DupACKs) are received. 3428 The value "three" is meant to allow for fast-retransmission of 3429 "missing" data, while avoiding network packet reordering from 3430 triggering loss recovery. 3432 Once packet loss is detected by the receipt of three duplicate-ACKs, 3433 the "Fast Recovery" algorithm governs the transfer of new data until 3434 a non-duplicate ACK is received that acknowledges the receipt of new 3435 data. The Fast Retransmit and Fast Recovery algorithms are usually 3436 implemented together, as follows (from RFC 2581): 3438 o When the third duplicate ACK is received, set ssthresh to no more 3439 than the value given in the equation: ssthresh = max (FlightSize / 3440 2, 2*SMSS) 3442 o Retransmit the lost segment and set cwnd to ssthresh plus 3*SMSS. 3443 This artificially "inflates" the congestion window by the number 3444 of segments (three) that have left the network and which the 3445 receiver has buffered. 3447 o For each additional duplicate ACK received, increment cwnd by 3448 SMSS. This artificially inflates the congestion window in order 3449 to reflect the additional segment that has left the network. 3451 o Transmit a segment, if allowed by the new value of cwnd and the 3452 receiver's advertised window. 3454 o When the next ACK arrives that acknowledges new data, set cwnd to 3455 ssthresh (the value set in step 1). This is termed "deflating" 3456 the window. 3458 9.1. Congestion control with misbehaving receivers 3460 [Savage et al, 1999] describes a number of ways in which TCP's 3461 congestion control mechanisms can be exploited by a misbehaving TCP 3462 receiver to obtain more than its fair share of bandwidth. The 3463 following subsections provide a brief discussion of these 3464 vulnerabilities, along with the possible countermeasures. 3466 9.1.1. ACK division 3468 Given that TCP updates cwnd based on the number of duplicate ACKs it 3469 receives, rather than on the amount of data that each ACK is actually 3470 acknowledging, a malicious TCP receiver could cause the TCP sender to 3471 illegitimately increase its congestion window by acknowledging a data 3472 segment with a number of separate Acknowledgements, each covering a 3473 distinct piece of the received data segment. 3475 See Figure 7, in page 64 of the UK CPNI document. 3477 ACK division attack 3479 [Savage et al, 1999] describes two possible countermeasures for this 3480 vulnerability. One of them is to increment cwnd not by a full SMSS, 3481 but proportionally to the amount of data being acknowledged by the 3482 received ACK, similarly to the policy described in RFC 3465 [Allman, 3483 2003]. Another alternative is to increase cwnd by one SMSS only when 3484 a valid ACK covers the entire data segment sent. 3486 9.1.2. DupACK forgery 3488 The second vulnerability discussed in [Savage et al, 1999] allows an 3489 attacker to cause the TCP sender to illegitimately increase its 3490 congestion window by forging a number of duplicate Acknowledgements 3491 (DupACKs). Figure 8 shows a sample scenario. The first three 3492 DupACKs trigger the Fast Recovery mechanism, while the rest of them 3493 cause the congestion window at the TCP sender to be illegitimately 3494 inflated. Thus, the attacker is able to illegitimately cause the TCP 3495 sender to increase its data transmission rate. 3497 See Figure 8, in page 65 of the UK CPNI document. 3499 DupACK forgery attack 3501 Fortunately, a number of sender-side heuristics can be implemented to 3502 mitigate this vulnerability. First, the TCP sender could keep track 3503 of the number of outstanding segment (o_seg), and accept only up to 3504 (o_seg -1) DupACKs. Secondly, a TCP sender might, for example, 3505 refuse to enter Fast Recovery multiple times in some period of time 3506 (e.g., one RTT). 3508 [Savage et al, 1999] also describes a modification to TCP to 3509 implement a nonce protocol that would eliminate this vulnerability. 3510 However, this would require modification of all implementations, 3511 which makes this counter-measure hard to deploy. 3513 9.1.3. Optimistic ACKing 3515 Another alternative for an attacker to exploit TCP's congestion 3516 control mechanisms is to acknowledge data that has not yet been 3517 received, thus causing the congestion window at the TCP sender to be 3518 incremented faster than it should. 3520 See Figure 9, in page 66 of the UK CPNI document. 3522 Optimistic ACKing attack 3524 [Savage et al, 1999] describes a number of mitigations for this 3525 vulnerability. Firstly, it describes a countermeasure based on the 3526 concept of "cumulative nonce", which would allow a receiver to prove 3527 that it has received all the segments it is acknowledging. However, 3528 this countermeasure requires the introduction of two new fields to 3529 the TCP header, thus requiring a modification to all the 3530 communicating TCPs, makes this counter-measure hard to deploy. 3531 Secondly, it describes a possible way to encode the nonce in a TCP 3532 segment by carefully modifying its size. While this countermeasure 3533 could be easily deployed (as it is just sender side policy), we 3534 believe that middle-boxes such as protocol-scrubbers might prevent 3535 this counter-measure from working as expected. Finally, it suggests 3536 that a TCP sender might penalize a TCP receiver that acknowledges 3537 data not yet sent by resetting the corresponding connection. Here we 3538 deprecate the implementation of this policy, as it would provide an 3539 attack vector for a TCP-based connection-reset attack, similar to 3540 those described in Section 11. 3542 [US-CERT, 2005a] is a vulnerability advisory about this issue. 3544 9.2. Blind DupACK triggering attacks against TCP 3546 While all of the attacks discussed in [Savage et al, 1999] have the 3547 goal of increasing the performance of the attacker's TCP connections, 3548 TCP congestion control mechanisms can be exploited with a variety of 3549 goals. 3551 Firstly, if bursts of many duplicate-ACKs are sent to the "sending 3552 TCP", the third duplicate-ACK will cause the "lost" segment to be 3553 retransmitted, and each subsequent duplicate-ACK will cause cwnd to 3554 be artificially inflated. Thus, the "sending TCP" might end up 3555 injecting more packets into the network than it really should, with 3556 the potential of causing network congestion. This is a potential 3557 consequence of the "Duplicate-ACK spoofing attack" described in 3558 [Savage et al, 1999]. 3560 Secondly, if bursts of three duplicate ACKs are sent to the TCP 3561 sender, the attacked system would infer packet loss, and ssthresh and 3562 cwnd would be reduced. As noted in RFC 2581 [Allman et al, 1999], 3563 causing two congestion control events back-to-back will often cut 3564 ssthresh and cwnd to their minimum value of 2*SMSS, with the 3565 connection immediately entering the slower-performing congestion 3566 avoidance phase. While it would not be attractive for an attacker to 3567 perform this attack against one of his TCP connections, the attack 3568 might be attractive when the TCP connection to be attacked is 3569 established between two other parties. 3571 It is usually assumed that in order for an off-path attacker to 3572 perform attacks against a third-party TCP connection, he should be 3573 able to guess a number of values, including a valid TCP Sequence 3574 Number and a valid TCP Acknowledgement Number. While this is true if 3575 the attacker tries to "inject" valid packets into the connection by 3576 himself, a feature of TCP can be exploited to fool one of the TCP 3577 endpoints to transmit valid duplicate Acknowledgements on behalf of 3578 the attacker, hence relieving the attacker of the hard task of 3579 forging valid values for the Sequence Number and Acknowledgement 3580 Number TCP header fields. 3582 Section 3.9 of RFC 793 [Postel, 1981c] describes the processing of 3583 incoming TCP segments as a function of the connection state and the 3584 contents of the various header fields of the received segment. For 3585 connections in the ESTABLISHED state, the first check that is 3586 performed on incoming segments is that they contain "in window" data. 3587 That is, 3589 RCV.NXT <= SEG.SEQ <= RCV.NXT+RCV.WND, or 3591 RCV.NXT <= SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND 3593 If a segment does not pass this check, it is dropped, and an 3594 Acknowledgement is sent in response: 3596 3598 The goal of this behavior is that, in the event data segments are 3599 received by the TCP receiver, but all the corresponding 3600 Acknowledgements are lost, when the TCP sender retransmits the 3601 supposedly lost data, the TCP receiver will send an Acknowledgement 3602 reflecting all the data received so far. If "old" TCP segments were 3603 silently dropped, the scenario just described would lead to a 3604 "frozen" TCP connection, with the TCP sender retransmitting the data 3605 for which it has not yet received an Acknowledgement, and the TCP 3606 receiver silently ignoring these segments. Additionally, it helps 3607 TCP to detect half-open connections. 3609 This feature implies that, provided the four-tuple that identifies a 3610 given TCP connection is known or can be easily guessed, an attacker 3611 could send a TCP segment with an "out of window" Sequence Number to 3612 one of the endpoints of the TCP connection to cause it to send a 3613 valid ACK to the other endpoint of the connection. Figure 10 3614 illustrates such a scenario. 3616 See Figure 10, in page 68 of the UK CPNI document. 3618 Blind Dup-ACK forgery attack 3620 As discussed in [Watson, 2004] and RFC 4953 [Touch, 2007], there are 3621 a number of scenarios in which the four-tuple that identifies a TCP 3622 connection is known or can be easily guessed. In those scenarios, an 3623 attacker could perform any of the "blind" attacks described in the 3624 following subsections by exploiting the technique described above. 3626 The following subsections describe blind DupACK-triggering attacks 3627 that aim at either degrading the performance of an arbitrary 3628 connection, or causing a TCP sender to illegitimately increase the 3629 rate at which it transmits data, potentially leading to network 3630 congestion. 3632 9.2.1. Blind throughput-reduction attack 3634 As discussed in Section 9, when three duplicate Acknowledgements are 3635 received, the congestion window is reduced to half the current amount 3636 of outstanding data (FlightSize). Additionally, the slow-start 3637 threshold (ssthresh) is reduced to the same value, causing the 3638 connection to enter the slower-performing congestion avoidance phase. 3639 If two congestion-control events occur back to back, ssthresh and 3640 cwnd will often be reduced to their minimum value of 2*SMSS. 3642 An attacker could exploit the technique described in Section 9.2 to 3643 cause the throughput of the attacked TCP connection to be reduced, by 3644 eliciting three duplicate acknowledgements from the TCP receiver, 3645 which would cause the TCP sender to reduce its congestion window. In 3646 principle, the attacker would need to send a burst of only three out- 3647 of-window segments. However, in case the TCP receiver implements an 3648 acknowledgement policy such as "ACK every other segment", four out- 3649 of-window segments might be needed. The first segment would cause 3650 the pending (delayed) Acknowledgement to be sent, and the next three 3651 segments would elicit the actual duplicate Acknowledgements. 3653 Figure 11 shows a time-line graph of a sample scenario. The burst of 3654 DupACKs (in green) elicited by the burst of out-of-window segments 3655 (in red) sent by the attacker causes the TCP sender to retransmit the 3656 missing segment (in blue) and enter the loss recovery phase. Once a 3657 segment that acknowledges new data is received by the TCP sender, the 3658 loss recovery phase ends, and cwnd and ssthresh are set to half the 3659 number of segments that were outstanding when the loss recovery phase 3660 was entered. 3662 See Figure 11, in page 69 of the UK CPNI document. 3664 Blind throughput-reduction attack (time-line graph) 3666 The graphic assumes that the TCP receiver sends an Acknowledgement 3667 for every other data segment it receives, and that the TCP sender 3668 implements Appropriate Byte Counting (specified in RFC 3465 [Allman, 3669 2003]) on the received Acknowledgement segments. However, 3670 implementation of these policies is not required for the attack to 3671 succeed. 3673 9.2.2. Blind flooding attack 3675 As discussed in Section 9, when three duplicate Acknowledgements are 3676 received, the "lost" segment is retransmitted, and the congestion 3677 window is artificially inflated for each DupACK received, until the 3678 loss recovery phase ends. By sending a long burst of out-of-window 3679 segments to the TCP receiver of the attacked connection, an attacker 3680 could elicit a long burst of valid duplicate acknowledgements that 3681 would illegitimately cause the TCP sender of the attacked TCP 3682 connection to increase its data transmission rate. 3684 Figure 12 shows a time-line graph for this attack. The long burst of 3685 DupACKs (in green) elicited by the long burst of out-of-window 3686 segments (in red) sent by the attacker causes the TCP sender to enter 3687 the loss recovery phase and illegitimately inflate the congestion 3688 window, leading to an increase in the data transmission rate. Once a 3689 segment that acknowledges new data is received by the TCP sender, the 3690 loss recovery phase ends, and the data transmission rate is reduced. 3692 See Figure 12, in page 70 of the UK CPNI document. 3694 Blind flooding attack (time-line graph) 3696 Figure 13 is a time-sequence graph produced from packet logs obtained 3697 from tests of the described attack in a real network. A burst of 3698 segments is sent upon receipt of the burst of Duplicate 3699 Acknowledgements illegitimately elicited by the attacker. Figure 14 3700 is an averaged-throughput graphic for the same time frame, which 3701 clearly shows the effect of the attack in terms of throughput. 3703 See Figure 13, in page 71 of the UK CPNI document. 3705 Blind flooding attack (time sequence graph) 3707 See Figure 14, in page 71 of the UK CPNI document. 3709 Blind flooding attack (averaged throughput graph) 3711 These graphics were produced with Shawn Ostermann's tcptrace tool 3712 [Ostermann, 2008]. An explanation of the format of the graphics can 3713 be found in tcptrace's manual (available at the project's web site: 3714 http://www.tcptrace.org). 3716 9.2.3. Difficulty in performing the attacks 3718 In order to exploit the technique described in Section 9.2 of this 3719 document, an attacker would need to know the four-tuple {IP Source 3720 Address, TCP Source Port, IP Destination Address, TCP Destination 3721 Port} that identifies the connection to be attacked. As discussed by 3722 [Watson, 2004] and RFC 4953 [Touch, 2007], there are a number of 3723 scenarios in which these values may be known or easily guessed. 3725 It is interesting to note that the attacks described in Section 9.2 3726 of this document will typically require a much smaller number of 3727 packets than other "blind" attacks against TCP, such as those 3728 described in [Watson, 2004] and RFC 4953 [Touch, 2007], as the 3729 technique discussed in Section 9.2 relieves the attacker from having 3730 to guess valid TCP Sequence Numbers and a TCP Acknowledgement 3731 numbers. 3733 The attacks described in Section 9.2.1 and Section 9.2.2 of this 3734 document require the attacker to forge the source address of the 3735 packets it sends. Therefore, if ingress/egress filtering is 3736 performed by intermediate systems, the attacker's packets would not 3737 get to the intended recipient, and thus the attack would not succeed. 3738 However, we consider that ingress/egress filtering cannot be relied 3739 upon as the first line of defense against these attacks. 3741 Finally, it is worth noting that in order to successfully perform the 3742 blind attacks discussed in Section 9.2.1 and Section 9.2.2 of this 3743 document, the burst of out-of-sequence segments sent by the attacker 3744 should not be intermixed with valid data segments sent by the TCP 3745 sender, or else the Acknowledgement number of the illegitimately- 3746 elicited ACK segments would change, and the Acknowledgements would 3747 not be considered "Duplicate Acknowledgements" by the TCP sender. 3748 Tests performed in real networks seem to suggest that this 3749 requirement is not hard to fulfill, though. 3751 9.2.4. Modifications to TCP's loss recovery algorithms 3753 There are a number of algorithms that augment TCP's loss recovery 3754 mechanism that have been suggested by TCP researchers and have been 3755 specified by the IETF in the RFC series. This section describes a 3756 number of these algorithms, and discusses how their implementation 3757 affects (or not) the vulnerability of TCP to the attacks discussed in 3758 Section 9.2.1 and Section 9.2.2 of this document. 3760 NewReno 3762 RFC 3782 [Floyd et al, 2004] specifies the NewReno algorithm, which 3763 is meant to improve TCP's performance in the presence of multiple 3764 losses in a single window of data. The implication of this algorithm 3765 with respect to the attacks discussed in the previous sections is 3766 that whenever either of the attacks is performed against a connection 3767 with a NewReno TCP sender, a full-window (or half a window) of data 3768 will be unnecessarily retransmitted. This is particularly 3769 interesting in the case of the blind-flooding attack, as the attack 3770 would elicit even more packets from the TCP sender. 3772 Whether a full-window or just half a window of data is retransmitted 3773 depends on the Acknowledgement policy at the TCP receiver. If the 3774 TCP receiver sends an Acknowledgement (ACK) for every segment, a 3775 full-window of data will be retransmitted. If the TCP receiver sends 3776 an Acknowledgement (ACK) for every other segment, then only half a 3777 window of data will be retransmitted. 3779 Figure 15 is a time-sequence graph produced from packet logs obtained 3780 from tests performed in a real network. Once loss recovery is 3781 illegitimately triggered by the duplicate-ACKs elicited by the 3782 attacker, an entire flight of data is unnecessarily retransmitted. 3783 Figure 16 is an averaged-throughput graphic for the same time-frame, 3784 which shows an increase in the throughput of the connection resulting 3785 from the retransmission of segments governed by NewReno's loss 3786 recovery. 3788 See Figure 15, in page 73 of the UK CPNI document. 3790 NewReno loss recovery (time-sequence graph) 3792 See Figure 16, in page 74 of the UK CPNI document. 3794 NewReno loss recovery (averaged throughput graph) 3796 Limited Transmit 3798 RFC 3042 [Allman et al, 2001] proposes an enhancement to TCP to more 3799 effectively recover lost segments when a connection's congestion 3800 window is small, or when a large number of segments are lost in a 3801 single transmission window. The "Limited Transmit" algorithm calls 3802 for sending a new data segment in response to each of the first two 3803 Duplicate Acknowledgements that arrive at the TCP sender. This would 3804 provide two additional transmitted packets that may be useful for the 3805 attacker in the case of the blind flooding attack described in 3806 Section 9.2.2 is performed. 3808 SACK-based loss recovery 3810 RFC 3517 [Blanton et al, 2003] specifies a conservative loss-recovery 3811 algorithm that is based on the use of the selective acknowledgement 3812 (SACK) TCP option. The algorithm uses DupACKs as an indication of 3813 congestion, as specified in RFC 2581 [Allman et al, 1999]. However, 3814 a difference between this algorithm and the basic algorithm described 3815 in RFC 2581 is that it clocks out segments only with the SACK 3816 information included in the DupACKs. That is, during the loss 3817 recovery phase, segments will be injected in the network only if the 3818 SACK information included in the received DupACKs indicates that one 3819 or more segments have left the network. As a result, those systems 3820 that implement SACK-based loss recovery will not be vulnerable to the 3821 blind flooding attack described in Section 9.2.2. However, as RFC 3822 3517 does not actually require DupACKs to include new SACK 3823 information (corresponding to data that has not yet been acknowledged 3824 by TCP's cumulative Acknowledgement), systems that implement SACK- 3825 based loss-recovery may still remain vulnerable to the blind 3826 throughput-reduction attack described in Section 9.2.1. SACK-based 3827 loss recovery implementations should be updated to implement the 3828 countermeasure ("Use of SACK information to validate DupACKs") 3829 described in Section 9.2.5. 3831 9.2.5. Countermeasures 3833 Validating TCP sequence numbers 3835 As discussed in Section 9.2, TCP responds with an ACK when an out-of- 3836 window segment is received, to accommodate those scenarios in which 3837 the Acknowledgement segments that correspond to some received data 3838 are lost in the network, and to help discover half-open TCP 3839 connections. 3841 However, it is possible to restrict the sequence numbers that are 3842 considered acceptable, and have TCP respond with ACKs only when it is 3843 strictly necessary. 3845 The following check could be performed on the TCP sequence number of 3846 an incoming TCP segment: 3848 RCV.NXT - MAX.RCV.WND <= SEG.SEQ <= RCV.NXT + RCV.WND 3850 Equation 2: Validating TCP Sequence Numbers 3852 where MAX.RCV.WND is the largest TCP window that has so far been 3853 advertised to the remote endpoint. 3855 If a segment passes this check, the processing rules specified in RFC 3856 793 [Postel, 1981c] should be applied. Otherwise, TCP should send an 3857 ACK (as specified by the processing rules in RFC 793 [Postel, 3858 1981c]), applying rate-limiting to the Acknowledgement segments sent 3859 in response to out-of-window segments. 3861 Discussion 3863 A feature of TCP is that, in some scenarios, it can detect half-open 3864 connections. If an implementation chose to silently drop those TCP 3865 segments that do not pass the check enforced by Equation 2, it could 3866 prevent TCP from detecting half-open connections. Figure 17 shows a 3867 scenario in which, provided that "TCP B" behaves as specified in RFC 3868 793, a half-open connection would be discovered and aborted. 3870 An established connection is said to be "half open" if one of the 3871 TCPs has closed or aborted the connection at its end without the 3872 knowledge of the other, or if the two ends of the connection have 3873 become desynchronized owing to a crash that resulted in loss of 3874 memory. 3876 See Figure 17, in page 76 of the UK CPNI document. 3878 Half-Open Connection Discovery 3880 In the scenario illustrated by Figure 17, TCP A crashes losing the 3881 connection-state information of the TCP connection with TCP B. In 3882 line 3, TCP A tries to establish a new connection with TCP B, using 3883 the same four-tuple {IP Source Address, TCP source port, IP 3884 Destination Address, TCP destination port}. In line 4, as the SYN 3885 segment is out of window, TCP B responds with an ACK. This ACK 3886 elicits an RST segment from TCP A, which causes the half-open 3887 connection at TCP B to be aborted. 3889 If the SYN segment had been "in window", TCP B would have sent an RST 3890 segment instead, which would have closed the half-open connection. 3891 Ongoing work at the TCPM WG of the IETF proposes to change this 3892 behavior, and make TCP respond to a SYN segment received for any of 3893 the synchronized states with an ACK segment, to avoid in-window SYN 3894 segments from being used to perform connection-reset attacks [Ramaiah 3895 et al, 2008]. 3897 However, in case the out-of-window segment was silently dropped, the 3898 scenario in Figure 17 would change into that in Figure 18. 3900 See Figure 18, in page 76 of the UK CPNI document. 3902 Half-Open Connection Discovery with the proposed counter-measure 3904 In line 3, the SYN segment sent by TCP A is silently dropped by TCP B 3905 because it does not pass the check enforced by Equation 2 (i.e., it 3906 contains an out-of-window sequence number). As a result, some time 3907 later (an RTO) TCP A retransmits its SYN segment. Even after TCP A 3908 times out, the half-open connection at TCP B will remain in the same 3909 state. 3911 Thus, a conservative reaction to those segments that do not pass the 3912 check enforced by Equation 2 would be to respond with an 3913 Acknowledgement segment (as specified by RFC 793), applying rate- 3914 limiting to those Acknowledgement segments sent in response to 3915 segments that do not pass the check enforced by that equation. An 3916 implementation might choose to enforce a rate-limit of, e.g., one ACK 3917 per five seconds, as a single ACK segment is needed for the Half-Open 3918 Connection Discovery mechanism to work. 3920 As the only reason to respond with an ACK to those segments that do 3921 not pass the check enforced by Equation 2 is to allow TCP to discover 3922 half-open connections, an aggressive rate-limit can be enforced. As 3923 long as the rate-limit prevents out-of-window segments from eliciting 3924 three Acknowledgment segments in a Round-trip Time (RTT), an attacker 3925 would not be able to trigger TCP's loss-recovery, and thus would not 3926 be able to perform the attacks described in the previous sections. 3928 It is interesting to note that RFC 793 [Postel, 1981c] itself states 3929 that half-open connections are expected to be unusual. Additionally, 3930 given that in many scenarios it may be unlikely for a TCP connection 3931 request to be issued with the same four-tuple as that of the half- 3932 open connection, a complete solution for the discovery of half-open 3933 connections cannot rely on the mechanism illustrated by Figure 17, 3934 either. Therefore, some implementations might choose to sacrifice 3935 TCP's ability to detect half-open connections, and have a more 3936 aggressive reaction to those segments that do not pass the check 3937 enforced by Equation 2 by silently dropping them. 3939 This validation check can also help to avoid ACK wars in some 3940 scenarios that may arise from the use of transparent proxies. In 3941 those scenarios, when the transparent proxy fails to wire (i.e., is 3942 disabled), the sequence numbers of the two end-points of the TCP 3943 connection become desynchronized, and both TCPs begin to send 3944 duplicate Acknowledgements to each other, with the intention of re- 3945 synchronizing them. As the sequence numbers never get re- 3946 synchronized, the ACK war can only be stopped by an external agent. 3948 Limiting the number of duplicate acknowledgments 3950 Given that duplicate acknowledgements should be elicited by out-of- 3951 order segments, a TCP sender could limit the number of duplicate 3952 acknowledgements it will honour to: 3954 Max_DupACKs = (FlightSize / SMSS) - 1 3956 Where FlightSize and SMSS are the values defined in RFC 2581 [Allman 3957 et al, 1999]. When more than Max_DupACKs duplicate acknowledgements 3958 are received, the exceeding DupACKs should be silently dropped. 3960 Use of SACK information to validate DupACKs 3962 SACK, specified in 2018 [Mathis et al, 1996], provides a mechanism 3963 for TCP to be able to acknowledge the receipt of out-of-order TCP 3964 segments. For connections that have agreed to use SACK, each 3965 legitimate DupACK will contain new SACK information that reflects the 3966 data bytes contained in the out-of-order data segment that elicited 3967 the DupACK. 3969 RFC 3517 [Blanton et al, 2003] specifies a SACK-based loss recovery 3970 algorithm for TCP. However, it does recommend TCP implementations to 3971 validate DupACKs by requiring that they contain new SACK information. 3972 Results obtained from auditing a number of TCP implementations seem 3973 to indicate that most TCP implementations do not enforce this 3974 validation check on incoming DupACKs, either. 3976 In the case of TCP connections that have agreed to use SACK, a 3977 validation check should be performed on incoming ACK segments to 3978 completely eliminate the attacks described in Section 9.2.1 and 3979 Section 9.2.2 of this document: "Duplicate ACKs should contain new 3980 SACK information. The SACK information should refer to data that has 3981 already been sent, but that has not yet been acknowledged by TCP's 3982 cumulative Acknowledgement". 3984 Those ACK segments that do not comply with this validation check 3985 should not be considered "duplicate ACKs", and thus should not 3986 trigger the loss-recovery phase. 3988 In case at least one segment in a window of data has been lost, the 3989 successive segments will elicit the generation of Duplicate ACKs 3990 containing new SACK information. This SACK information will indicate 3991 the receipt of these successive segments by the TCP receiver. 3993 In the case of pure ACKs illegitimately elicited by out-of-window 3994 segments, however, the ACKs will not contain any SACK information. 3996 If DSACK (specified in 2883 [Floyd et al, 2000]) were implemented by 3997 the TCP receiver, then the illegitimately elicited DupACKs might 3998 contain out-of-window SACK information if the sequence number of the 3999 forged TCP segment (SEG.SEQ) is lower than the next expected sequence 4000 number (RECV.NXT) at the TCP receiver. Such segments should be 4001 considered to indicate the receipt of duplicate data, rather than an 4002 indication of lost data, and therefore should not trigger loss 4003 recovery. 4005 TCP port number randomization 4007 As in order to perform the blind attacks described in Section 9.2.1 4008 and Section 9.2.2 the attacker needs to know the TCP port numbers in 4009 use by the connection to be attacked, obfuscating the TCP source port 4010 used for outgoing TCP connections will increase the number of packets 4011 required to successfully perform these attacks. Section 3.1 of this 4012 document discusses the use of port randomization. 4014 It must be noted that given that these blind DupACK triggering 4015 attacks do not require the attacker to forge valid TCP Sequence 4016 numbers and TCP Acknowledgement numbers, port randomization should 4017 not be relied upon as a first line of defense. 4019 Ingress and Egress filtering 4021 Ingress and Egress filtering reduces the number of systems in the 4022 global Internet that can perform attacks that rely on forged source 4023 IP addresses. While protection from the blind attacks discussed in 4024 Section 9.2 should not rely only on Ingress and Egress filtering, its 4025 deployment is recommended to help prevent all attacks that rely on 4026 forged IP addresses. RFC 3704 [Baker and Savola, 2004], RFC 2827 4027 [Ferguson and Senie, 2000], and [NISCC, 2006] provide advice on 4028 Ingress and Egress filtering. 4030 Generalized TTL Security Mechanism (GTSM) 4032 RFC 5082 [Gill et al, 2007] proposes a check on the TTL field of the 4033 IP packets that correspond to a given TCP connection to reduce the 4034 number of systems that could successfully attack the protected TCP 4035 connection. It provides for the attacks discussed in this document 4036 the same level of protection than for the attacks described in 4037 [Watson, 2004] and RFC 4953 [Touch, 2007]. While implementation of 4038 this mechanism may be useful in some scenarios, it should be clear 4039 that countermeasures discussed in the previous sections provide a 4040 more effective and simpler solution than that provided by the GTSM. 4042 9.3. TCP Explicit Congestion Notification (ECN) 4044 ECN (Explicit Congestion Notification) provides a mechanism for 4045 intermediate systems to signal congestion to the communicating 4046 endpoints that in some scenarios can be used as an alternative to 4047 dropping packets. 4049 RFC 3168 [Ramakrishnan et al, 2001] contains a detailed discussion of 4050 the possible ways and scenarios in which ECN could be exploited by an 4051 attacker. 4053 RFC 3540 [Spring et al, 2003] specifies an improvement to ECN based 4054 on nonces, that protects against accidental or malicious concealment 4055 of marked packets from the TCP sender. The specified mechanism 4056 defines a "NS" ("Nonce Sum") field in the TCP header that makes use 4057 of one bit from the Reserved field, and requires a modification in 4058 both of the endpoints of a TCP connection to process this new field. 4059 This mechanism is still in "Experimental" status, and since it might 4060 suffer from the behavior of some middle-boxes such as firewalls or 4061 packet-scrubbers, we defer a recommendation of this mechanism until 4062 more experience is gained. 4064 There also is ongoing work in the research community and the IETF to 4065 define alternate semantics for the ECN field of the IP header (e.g., 4066 see [PCNWG, 2009]). 4068 The following subsections try to summarize the security implications 4069 of ECN. 4071 9.3.1. Possible attacks by a compromised router 4073 Firstly, a router controlled by a malicious user could erase the CE 4074 codepoint (either by replacing it with the ECT(0), ECT(1), or non-ECT 4075 codepoints), effectively eliminating the congestion indication. As a 4076 result, the corresponding TCP sender would not reduce its data 4077 transmission rate, possibly leading to network congestion. This 4078 could also lead to unfairness, as this flow could experience better 4079 performance than other flows for which the congestion indication is 4080 not erased (and thus their transmission rate is reduced). 4082 Secondly, a router controlled by a malicious user could 4083 illegitimately set the CE codepoint, falsely indicating congestion, 4084 to cause the TCP sender to reduce its data transmission rate. 4085 However, this particular attack is no worse than the malicious router 4086 simply dropping the packets rather setting their CE codepoint. 4088 Thirdly, a malicious router could turn off the ECT codepoint of a 4089 packet, thus disabling ECN support. As a result, if the packet later 4090 arrives at a router that is experiencing congestion, it may be 4091 dropped rather than marked. As with the previous scenario, though, 4092 this is no worse than the malicious router simply dropping the 4093 corresponding packet. 4095 It should be noted that a compromised on-path IP router could engage 4096 in a much broader range of attacks, with broader impacts, and at much 4097 lower attacker cost than the ones described here. Such a compromised 4098 router is extremely unlikely to engage in the attack vectors 4099 discussed in this section, given the existence of more effective 4100 attack vectors that have lower attacker cost. 4102 9.3.2. Possible attacks by a malicious TCP endpoint 4104 If a packet with the ECT codepoint set arrives at an ECN-capable 4105 router that is experiencing moderate congestion, the router may 4106 decide to set its CE codepoint instead of dropping it. If either of 4107 the TCP endpoints do not honour the congestion indication provided by 4108 an ECN-capable router, this would result in unfairness, as other 4109 (legitimate) ECN-capable flows would still reduce their sending rate 4110 in response to the ECN marking of packets. Furthermore, under 4111 moderate congestion, non-ECN-capable flows would be subject to packet 4112 drops by the same router. As a result, the flow with a malicious TCP 4113 end-point would obtain better service than the legitimate flows. 4115 As noted in RFC 3168 [Ramakrishnan et al, 2001], a TCP endpoint 4116 falsely indicating ECN capability could lead to unfairness, allowing 4117 the mis-beheaving flow to get more than its fair share of the 4118 bandwidth. This could be the result of the mis-behavior of either of 4119 the TCP endpoints. For example, the sending TCP could indicate ECN 4120 capability, but then send a CWR in response to an ECE without 4121 actually reducing its congestion window. Alternatively (or in 4122 addition), the receiving TCP could simply ignore those packets with 4123 the CE codepoint set, thus avoiding the sending TCP from receiving 4124 the congestion indication. 4126 In the case of the sending TCP ignoring the ECN congestion 4127 indication, this would be no worse than the sending TCP ignoring the 4128 congestion indication provided by a lost segment. However, the case 4129 of a TCP receiver ignoring the CE codepoint allows the TCP receiver 4130 to get more than its fair share of bandwidth in a way that was 4131 previously unavailable. If congestion was kept "moderate", then the 4132 malicious TCP receiver could maintain the unfairness, as the router 4133 experiencing congestion would mark the offending packets of the 4134 misbehaving flow rather than dropping them. At the same time, 4135 legitimate ECN-capable flows would respond to the congestion 4136 indication provided by the CE codepoint, while legitimate non-ECN- 4137 capable flows would be subject of packet dropping. However, if 4138 congestion turned to sufficiently heavy, the router experiencing 4139 congestion would switch from marking packets to dropping packets, and 4140 at that point the attack vector provided by ECN could no longer be 4141 exploited (until congestion returns to moderate state). 4143 RFC 3168 [Ramakrishnan et al, 2001] describes the use of "penalty 4144 boxes" which would act on flows that do not respond appropriately to 4145 congestion indications. Section 10 of RFC 3168 suggests that a first 4146 action taken at a penalty box for an ECN-capable flow would be to 4147 switch to dropping packets (instead of marking them), and, if the 4148 flow does not respond appropriately to the congestion indication, the 4149 penalty box could reset the misbehaving connection. Here we 4150 discourage implementation of such a policy, as it would create a 4151 vector for connection-reset attacks. For example, an attacker could 4152 forge TCP segments with the same four-tuple as the targeted 4153 connection and cause them to transit the penalty box. The penalty 4154 box would first switch from marking to dropping packets. However, 4155 the attacker would continue sending forged segments, at a steady 4156 rate. As a result, if the penalty box implemented such a severe 4157 policy of resetting connections for flows that still do not respond 4158 to end-to-end congestion control after switching from marking to 4159 dropping, the attacked connection would be reset. 4161 10. TCP API 4163 Section 3.8 of RFC 793 [Postel, 1981c] describes the minimum set of 4164 TCP User Commands required of all TCP Implementations. Most 4165 operating systems provide an Application Programming Interface (API) 4166 that allows applications to make use of the services provided by TCP. 4167 One of the most popular APIs is the Sockets API, originally 4168 introduced in the BSD networking package [McKusick et al, 1996]. 4170 10.1. Passive opens and binding sockets 4172 RFC 793 specifies the syntax of the "OPEN" command, which can be used 4173 to perform both passive and active opens. The syntax of this command 4174 is as follows: 4176 OPEN (local port, foreign socket, active/passive [, timeout] [, 4177 precedence] [, security/compartment] [, options]) -> local connection 4178 name 4180 When this command is used to perform a passive open (i.e., the 4181 active/passive flag is set to passive), the foreign socket parameter 4182 may be either fully-specified (to wait for a particular connection) 4183 or unspecified (to wait for any call). 4185 As discussed in Section 2.7 of RFC 793 [Postel, 1981c], if there are 4186 several passive OPENs with the same local socket (recorded in the 4187 corresponding TCB), an incoming connection will be matched to the TCB 4188 with the more specific foreign socket. This means that when the 4189 foreign socket of a passive OPEN matches that of the incoming 4190 connection request, that passive OPEN takes precedence over those 4191 passive OPENs with an unspecified foreign socket. 4193 Popular implementations such as the Sockets API let the user specify 4194 the local socket as fully-specified {local IP address, local TCP 4195 port} pair, or as just the local TCP port (leaving the local IP 4196 address unspecified). In the former case, only those connection 4197 requests sent to {local port, local IP address} will be accepted. In 4198 the latter case, connection requests sent to any of the system's IP 4199 addresses will be accepted. In a similar fashion to the generic API 4200 described in Section 2.7 of RFC 793, if there is a pending passive 4201 OPEN with a fully-specified local socket that matches that for which 4202 a connection establishment request has been received, that local 4203 socket will take precedence over those which have left the local IP 4204 address unspecified. The implication of this is that an attacker 4205 could "steal" incoming connection requests meant for a local 4206 application by performing a passive OPEN that is more specific than 4207 that performed by the legitimate application. 4209 In order to eliminate this vulnerability, when there is already a 4210 pending passive OPEN for some local port number, only processes 4211 belonging to the same user should be able to "reuse" the local port 4212 for another passive OPEN. Additionally, reuse of a local port could 4213 default to "off", and be enabled only by an explicit command (e.g., 4214 the setsockopt() function of the Sockets API). 4216 10.2. Active opens and binding sockets 4218 As discussed in Section 10.1, the "OPEN" command specified in Section 4219 3.8 of RFC 793 [Postel, 1981c] can be used to perform active opens. 4220 In case of active opens, the parameter "local port" will contain a 4221 so-called "ephemeral port". While the only requirement for such an 4222 ephemeral port is that the resulting connection-id is unique, port 4223 numbers that are currently in use by a TCP in the LISTEN state should 4224 not be allowed for use as ephemeral ports. If this rule is not 4225 complied, an attacker could potentially steal" an incoming connection 4226 to a local server application by issuing a connection request to the 4227 victim client at roughly the same time the client tries to connect to 4228 the victim server application. If the SYN segment corresponding to 4229 the attacker's connection request and the SYN segment corresponding 4230 to the victim client "cross each other in the network", and provided 4231 the attacker is able to know or guess the ephemeral port used by the 4232 client, a TCP simultaneous open scenario would take place, and the 4233 incoming connection request sent by the client would be matched with 4234 the attacker's socket rather than with the victim server 4235 application's socket. 4237 As already noted, in order for this attack to succeed, the attacker 4238 should be able to guess or know (in advance) the ephemeral port 4239 selected by the victim client, and be able to know the right moment 4240 to issue a connection request to the victim client. While in many 4241 scenarios this may prove to be a difficult task, some factors such as 4242 an inadequate ephemeral port selection policy at the victim client 4243 could make this attack feasible. 4245 It should be noted that most applications based on popular 4246 implementations of TCP API (such as the Sockets API) perform "passive 4247 opens" in three steps. Firstly, the application obtains a file 4248 descriptor to be used for inter-process communication (e.g., by 4249 issuing a socket() call). Secondly, the application binds the file 4250 descriptor to a local TCP port number (e.g., by issuing a bind() 4251 call), thus creating a TCP in the fictional CLOSED state. Thirdly, 4252 the aforementioned TCP is put in the LISTEN state (e.g., by issuing a 4253 listen() call). As a result, with such an implementation of the TCP 4254 API, even if port numbers in use for TCPs in the LISTEN state were 4255 not allowed for use as ephemeral ports, there is a window of time 4256 between the second and the third steps in which an attacker could be 4257 allowed to select a port number that would be later used for 4258 listening to incoming connections. Therefore, these implementations 4259 of the TCP API should enforce a stricter requirement for the 4260 allocation of port numbers: port numbers that are in use by a TCP in 4261 the LISTEN or CLOSED states should not be allowed for allocation as 4262 ephemeral ports. 4264 An implementation might choose to relax the aforementioned 4265 restriction when the process or system user requesting allocation of 4266 such a port number is the same that the process or system user 4267 controlling the TCP in the CLOSED or LISTEN states with the same port 4268 number. 4270 11. Blind in-window attacks 4272 In the last few years awareness has been raised about a number of 4273 "blind" attacks that can be performed against TCP by forging TCP 4274 segments that fall within the receive window [NISCC, 2004] [Watson, 4275 2004]. 4277 The term "blind" refers to the fact that the attacker does not have 4278 access to the packets that belong to the attacked connection. 4280 The effects of these attacks range from connection resets to data 4281 injection. While these attacks were known in the research community, 4282 they were generally considered unfeasible. However, increases in 4283 bandwidth availability and the use of larger TCP windows raised 4284 concerns in the community. The following subsections discuss a 4285 number of forgery attacks against TCP, along with the possible 4286 countermeasures to mitigate their impact. 4288 11.1. Blind TCP-based connection-reset attacks 4290 Blind connection-reset attacks have the goal of causing a TCP 4291 connection maintained between two TCP endpoints to be aborted. The 4292 level of damage that the attack may cause usually depends on the 4293 application running on top of TCP, with the more vulnerable 4294 applications being those that rely on long-lived TCP connections. 4296 An interesting case of such applications is BGP [Rekhter et al, 4297 2006], in which a connection-reset usually results in the 4298 corresponding entries of the routing table being flushed. 4300 There are a variety of vectors for performing TCP-based connection- 4301 reset attacks against TCP. [Watson, 2004] and [NISCC, 2004] raised 4302 awareness about connection-reset attacks that exploit the RST flag of 4303 TCP segments. [Ramaiah et al, 2008] noted that carefully crafted SYN 4304 segments could also be used to perform connection-reset attacks. 4305 This document describes yet two previously undocumented vectors for 4306 performing connection-reset attacks: the Precedence field of IP 4307 packets that encapsulate TCP segments, and illegal TCP options. 4309 11.1.1. RST flag 4311 The RST flag signals a TCP peer that the connection should be 4312 aborted. In contrast with the FIN handshake (which gracefully 4313 terminates a TCP connection), an RST segment causes the connection to 4314 be abnormally closed. 4316 As stated in Section 3.4 of RFC 793 [Postel, 1981c], all reset 4317 segments are validated by checking their Sequence Numbers, with the 4318 Sequence Number considered valid if it is within the receive window. 4319 In the SYN-SENT state, however, an RST is valid if the 4320 Acknowledgement Number acknowledges the SYN segment that supposedly 4321 elicited the reset. 4323 [Ramaiah et al, 2008] proposes a modification to TCP's transition 4324 diagram to address this attack vector. The counter-measure is a 4325 combination of enforcing a more strict validation check on the 4326 sequence number of reset segments, and the addition of a "challenge" 4327 mechanism. With the implementation of the proposed mechanism, TCP 4328 would behave as follows: 4330 If the Sequence Number of an RST segment is outside the receive 4331 window, the segment is silently dropped (as stated by RFC 793). That 4332 is, a reset segment is discarded unless it passes the following 4333 check: 4335 RCV.NXT <= Sequence Number < RCV.NXT+RCV.WND 4337 If the sequence number falls exactly on the left-edge of the receive 4338 window, the reset is honoured. That is, the connection is reset if 4339 the following condition is true: 4341 Sequence Number == RCV.NXT 4343 If an RST segment passes the first check (i.e., it is within the 4344 receive window) but does not pass the second check (i.e., it does not 4345 fall exactly on the left edge of the receive window), an 4346 Acknowledgement segment ("challenge ACK") is set in response: 4348 4350 This Acknowledgement segment is referred to as a "challenge ACK" as, 4351 in the event the RST segment that elicited it had been legitimate 4352 (but silently dropped as a result of enforcing the above checks), the 4353 challenge ACK would elicit a new reset segment that would fall 4354 exactly on the left edge of the window and would thus pass all the 4355 above checks, finally resetting the connection. 4357 We recommend the implementation of this countermeasure. However, we 4358 are aware of patent claims on this counter-measure, and suggest 4359 vendors to research the consequences of the possible patents that may 4360 apply. 4362 [US-CERT, 2003a] is an advisory of a firewall system that was found 4363 particularly vulnerable to resets attack because of not validating 4364 the TCP Sequence Number of RST segments. Clearly, all TCPs 4365 (including those in middle-boxes) should validate RST segments as 4366 discussed in this section. 4368 11.1.2. SYN flag 4370 Section 3.9 (page 71) of RFC 793 [Postel, 1981c] states that if a SYN 4371 segment is received with a valid (i.e., "in window") Sequence Number, 4372 an RST segment should be sent in response, and the connection should 4373 be aborted. 4375 The IETF has been working on a document, "Improving TCP's Resistance 4376 to Blind In-Window Attacks" [Ramaiah et al, 2008] which addresses, 4377 among others, this variant of TCP-based connection-reset attack. 4378 This section describes the counter-measure proposed by the IETF, a 4379 problem that may arise from the implementation of that solution, and 4380 a workaround to it. 4382 In order to mitigate this attack vector, [Ramaiah et al, 2008] 4383 proposes to change TCP's reaction to SYN segments as follows. When a 4384 SYN segment is received for a connection in any of the synchronized 4385 states, an Acknowledgement (ACK) segment is sent in response. 4387 As discussed in [Ramaiah et al, 2008], there is a corner-case that 4388 would not be properly handled by this mechanism. If a host (TCP A) 4389 establishes a TCP connection with a remote peer (TCP B), and then 4390 crashes, reboots and tries to initiate a new incarnation of the same 4391 connection (i.e., a connection with the same four-tuple as the 4392 previous connection) using an Initial Sequence Number equal to the 4393 RCV.NXT value at the remote peer (TCP B), the ACK segment sent by TCP 4394 B in response to the SYN segment would contain an Acknowledgement 4395 number that would be considered valid by TCP A, and thus an RST 4396 segment would not be sent in response to the Acknowledgement (ACK) 4397 segment. As this ACK would not have the SYN bit set, TCP A (being in 4398 the SYN-SENT state) would silently drop it (as stated on page 68 of 4399 RFC 793). After a Retransmission Timeout (RTO), TCP A would 4400 retransmit its SYN segment, which would lead to the same sequence of 4401 events as before. Eventually, TCP A would timeout, and the 4402 connection would be aborted. This is a corner case in which the 4403 introduced change would lead to a non-desirable behavior. However, 4404 we consider this scenario to be extremely unlikely and, in the event 4405 it ever took place, the connection would nevertheless be aborted 4406 after retrying for a period of USER TIMEOUT seconds. 4408 However, when this change is implemented exactly as described in 4409 [Ramaiah et al, 2008], the potential of interoperability problems is 4410 introduced, as a heuristic widely incorporated in many TCP 4411 implementations is disabled. 4413 In a number of scenarios a socket pair may need to be reused while 4414 the corresponding four-tuple is still in the TIME-WAIT state in a 4415 remote TCP peer. For example, a client accessing some service on a 4416 host may try to create a new incarnation of a previous connection, 4417 while the corresponding four-tuple is still in the TIME-WAIT state at 4418 the remote TCP peer (the server). This may happen if the ephemeral 4419 port numbers are being reused too quickly, either because of a bad 4420 policy of selection of ephemeral ports, or simply because of a high 4421 connection rate to the corresponding service. In such scenarios, the 4422 establishment of new connections that reuse a four-tuple that is in 4423 the TIME-WAIT state would fail. In order to avoid this problem, RFC 4424 1122 [Braden, 1989] states (in Section 4.2.2.13) that when a 4425 connection request is received with a four-tuple that is in the TIME- 4426 WAIT state, the connection request could be accepted if the sequence 4427 number of the incoming SYN segment is greater than the last sequence 4428 number seen on the previous incarnation of the connection (for that 4429 direction of the data transfer). 4431 This requirement aims at avoiding the sequence number space of the 4432 new and old incarnations of the connection to overlap, thus avoiding 4433 old segments from the previous incarnation of the connection to be 4434 accepted as valid by the new connection. 4436 The requirement in [Ramaiah et al, 2008] to disregard SYN segments 4437 received for connections in any of the synchronized states forbids 4438 the implementation of the heuristic described above. As a result, we 4439 argue that the processing of SYN segments proposed in [Ramaiah et al, 4440 2008] should apply only for connections in any of the synchronized 4441 states other than the TIME-WAIT state. 4443 The following paragraphs summarize the processing of SYN segments in 4444 the synchronized states, such that connection-reset attacks are 4445 mitigated, while interoperability is not affected. Additionally, the 4446 timestamp option of the incoming SYN segment is included (if present) 4447 in the heuristics performed for allowing a high connection- 4448 establishment rate, thus improving the robustness of TCP. 4450 Processing of SYN segments received for connections in the 4451 synchronized states should occur as follows: 4453 o If a SYN segment is received for a connection in any synchronized 4454 state other than TIME-WAIT, respond with an ACK, applying rate- 4455 throttling. 4457 o If the corresponding connection is in the TIME-WAIT state, then, 4459 * If the previous incarnation of the connection used timestamps, 4460 then, 4462 + If TCP timestamps would be enabled for the new incarnation 4463 of the connection, and the timestamp contained in the 4464 incoming SYN segment is greater than the last timestamp seen 4465 on the previous incarnation of the connection (for that 4466 direction of the data transfer), honour the connection 4467 request (creating a connection in the SYN-RECEIVED state). 4469 + If TCP timestamps would be enabled for the new incarnation 4470 of the connection, the timestamp contained in the incoming 4471 SYN segment is equal to the last timestamp seen on the 4472 previous incarnation of the connection (for that direction 4473 of the data transfer), and the Sequence Number of the 4474 incoming SYN segment is larger than the last sequence number 4475 seen on the previous incarnation of the connection (for that 4476 direction of the data transfer), then honour the connection 4477 request (creating a connection in the SYN-RECEIVED state). 4479 + If TCP timestamps would not be enabled for the new 4480 incarnation of the connection, but the Sequence Number of 4481 the incoming SYN segment is larger than the last sequence 4482 number seen on the previous incarnation of the connection 4483 (for the same direction of the data transfer), honour the 4484 connection request (creating a connection in the SYN- 4485 RECEIVED state). 4487 + Otherwise, silently drop the incoming SYN segment, thus 4488 leaving the previous incarnation of the connection in the 4489 TIME-WAIT state. 4491 * If the previous incarnation of the connection did not use 4492 timestamps, then, 4494 + If TCP timestamps would be enabled for the new incarnation 4495 of the connection, honour the incoming connection request. 4497 + If TCP timestamps would not be enabled for the new 4498 incarnation of the connection, but the Sequence Number of 4499 the incoming SYN segment is larger than the last sequence 4500 number seen on the previous incarnation of the connection 4501 (for the same direction of the data transfer), then honour 4502 the incoming connection request (even if the sequence number 4503 of the incoming SYN segment falls within the receive window 4504 of the previous incarnation of the connection). 4506 + Otherwise, silently drop the incoming SYN segment, thus 4507 leaving the previous incarnation of the connection in the 4508 TIME-WAIT state. 4510 In the above explanation, the phrase "TCP timestamps would be enabled 4511 for the new incarnation for the connection" means that the incoming 4512 SYN segment contains a TCP Timestamps option (i.e., the client has 4513 enabled TCP timestamps), and that the SYN/ACK segment that would be 4514 sent in response to it would also contain a Timestamps option (i.e., 4515 the server has enabled TCP timestamps). In such a scenario, TCP 4516 timestamps would be enabled for the new incarnation of the 4517 connection. 4519 The "last sequence number seen on the previous incarnation of the 4520 connection (for the same direction of the data transfer)" refers to 4521 the last sequence number used by the previous incarnation of the 4522 connection (for the same direction of the data transfer), and not to 4523 the last value seen in the Sequence Number field of the corresponding 4524 segments. That is, it refers to the sequence number corresponding to 4525 the FIN flag of the previous incarnation of the connection, for that 4526 direction of the data transfer. 4528 The processing rules proposed in this Section do not comply with one 4529 of the requirements in the upcoming RFC "Improving TCP's Robustness 4530 to Blind In-Window Attacks" [Ramaiah et al, 2008], which requires 4531 implementations to send an ACK in response to in-window SYN segments 4532 received for connections in any of the synchronized states (including 4533 the TIME-WAIT state). 4535 Many implementations do not include the TCP timestamp option when 4536 performing the above heuristics, thus imposing stricter constraints 4537 on the generation of Initial Sequence Numbers, the average data 4538 transfer rate of the connections, and the amount of data transferred 4539 with them. RFC 793 [Postel, 1981c] states that the ISN generator 4540 should be incremented roughly once every four microseconds (i.e., 4541 roughly 250000 times per second). As a result, any connection that 4542 transfers more than 250000 bytes of data at more than 250 KB/s could 4543 lead to scenarios in which the last sequence number seen on a 4544 connection that moves into the TIME-WAIT state may still be greater 4545 than the sequence number of an incoming SYN segment that aims at 4546 creating a new incarnation of the same connection. In those 4547 scenarios, the 4.4BSD heuristics would fail, and therefore the 4548 connection request would usually time out. By including the TCP 4549 timestamp option in the heuristics described above, all these 4550 constraints are greatly relaxed. 4552 It is clear that the use of TCP timestamps for the heuristics 4553 described above depends on the timestamps to be monotonically 4554 increasing across connections between the same two TCP endpoints. 4555 Therefore, we strongly advice to generate timestamps as described in 4556 Section 4.7.1. 4558 11.1.3. Security/Compartment 4560 Section 3.9 (page 71) of RFC 793 [Postel, 1981c] states that if the 4561 IP security/compartment of an incoming segment does not exactly match 4562 the security/compartment in the TCB, a RST segment should be sent, 4563 and the connection should be aborted. 4565 A discussion of the IP security options relevant to this section can 4566 be found in Section 3.13.2.12, Section 3.13.2.13, and Section 4567 3.13.2.14 of [CPNI, 2008]. 4569 This certainly provides another attack vector for performing 4570 connection-reset attacks, as an attacker could forge TCP segments 4571 with a security/compartment that is different from that recorded in 4572 the corresponding TCB and, as a result, the attacked connection would 4573 be reset. 4575 It is interesting to note that for connections in the ESTABLISHED 4576 state, this check is performed after validating the TCP Sequence 4577 Number and checking the RST bit, but before validating the 4578 Acknowledgement field. Therefore, even if the stricter validation of 4579 the Acknowledgement field (described in Section 3.4) was implemented, 4580 it would not help to mitigate this attack vector. 4582 This attack vector can be easily mitigated by relaxing the reaction 4583 to TCP segments with "incorrect" security/compartment values: if the 4584 security/compartment field does not match the value recorded in the 4585 corresponding TCB, TCP should not abort the connection, but simply 4586 discard the corresponding packet. Additionally, this whole event 4587 should be logged as a security violation. 4589 11.1.4. Precedence 4591 Section 3.9 (page 71) of RFC 793 [Postel, 1981c] states that if the 4592 IP Precedence of an incoming segment does not exactly match the 4593 Precedence recorded in the TCB, a RST segment should be sent, and the 4594 connection should be aborted. 4596 This certainly provides another attack vector for performing 4597 connection-reset attacks, as an attacker could forge TCP segments 4598 with a IP Precedence that is different from that recorded in the 4599 corresponding TCB and, as a result, the attacked connection would be 4600 reset. 4602 It is interesting to note that for connections in the ESTABLISHED 4603 state, this check is performed after validating the TCP Sequence 4604 Number and checking the RST bit, but before validating the 4605 Acknowledgement field. Therefore, even if the stricter validation of 4606 the Acknowledgement field (described in Section 3.4) were 4607 implemented, it would not help to mitigate this attack vector. 4609 This attack vector can be easily mitigated by relaxing the reaction 4610 to TCP segments with "incorrect" IP Precedence values. That is, even 4611 if the Precedence field does not match the value recorded in the 4612 corresponding TCB, TCP should not abort the connection, and should 4613 instead continue processing the segment as specified by RFC 793. 4615 It is interesting to note that resetting a connection due to a change 4616 in the Precedence value might have a negative impact on 4617 interoperability. For example, the packets that correspond to the 4618 connection could temporarily take a different internet path, in which 4619 some middle-box could re-mark the Precedence field (due to 4620 administration policies at the network to be transited). In such a 4621 scenario, an implementation following the advice in RFC 793 would 4622 abort the connection, when the connection would have probably 4623 survived. 4625 While the IPv4 Type of Service field (and hence the Precedence field) 4626 has been redefined by the Differentiated Services (DS) field 4627 specified in RFC 2474 [Nichols et al, 1998], RFC 793 [Postel, 1981c] 4628 was never formally updated in this respect. We note that both legacy 4629 systems that have not been upgraded to implement the differentiated 4630 services architecture described in RFC 2475 [Blake et al, 1998] and 4631 current implementations that have extrapolated the discussion of the 4632 Precedence field to the Differentiated Services field may still be 4633 vulnerable to the connection reset vector discussed in this section. 4635 11.1.5. Illegal options 4637 Section 4.2.2.5 of RFC 1122 [Braden, 1989] discusses the processing 4638 of TCP options. It states that TCP must be able to receive a TCP 4639 option in any segment, and must ignore without error any option it 4640 does not implement. Additionally, it states that TCP should be 4641 prepared to handle an illegal option length (e.g., zero) without 4642 crashing, and suggests handling such illegal options by resetting the 4643 corresponding connection and logging the reason. However, this 4644 suggested behavior could be exploited to perform connection-reset 4645 attacks. Therefore, as discussed in Section 3.10 of this document, 4646 we advise TCP implementations to silently drop those TCP segments 4647 that contain illegal option lengths. 4649 11.2. Blind data-injection attacks 4651 An attacker could try to inject data in the stream of data being 4652 transferred on the connection. As with the other attacks described 4653 in Section 11 of this document, in order to perform a blind data 4654 injection attack the attacker would need to know or guess the four- 4655 tuple that identifies the TCP connection to be attacked. 4656 Additionally, he should be able to guess a valid ("in window") TCP 4657 Sequence Number, and a valid Acknowledgement Number. 4659 As discussed in Section 3.4 of this document, [Ramaiah et al, 2008] 4660 propose to enforce a more strict check on the Acknowledgement Number 4661 of incoming segments than that specified in RFC 793 [Postel, 1981c]. 4663 Implementation of the proposed check requires more packets on the 4664 side of the attacker to successfully perform a blind data-injection 4665 attack. However, it should be noted that applications concerned with 4666 any of the attacks discussed in Section 11 of this document should 4667 make use of proper authentication techniques, such as those specified 4668 for IPsec in RFC 4301 [Kent and Seo, 2005]. 4670 12. Information leaking 4672 12.1. Remote Operating System detection via TCP/IP stack fingerprinting 4674 Clearly, remote Operating System (OS) detection is a useful tool for 4675 attackers. Tools such as nmap [Fyodor, 2006b] can usually detect the 4676 operating system type and version of a remote system with an 4677 amazingly accurate precision. This information can in turn be used 4678 by attackers to tailor their exploits to the identified operating 4679 system type and version. 4681 Evasion of OS fingerprinting can prove to be a very difficult task. 4683 Most systems make use of a variety of protocols, each of which have a 4684 large number of parameters that can be set to arbitrary values. 4685 Thus, information on the operating system may be obtained from a 4686 number of sources ranging from application banners to more obscure 4687 parameters such as TCP's retransmission timer. 4689 Nmap [Fyodor, 2006b] is probably the most popular tool for remote OS 4690 detection via active TCP/IP stack fingerprinting. p0f [Zalewski, 4691 2006a], on the other hand, is a tool for performing remote OS 4692 detection via passive TCP/IP stack fingerprinting. SinFP [SinFP, 4693 2006] can perform both active and passive fingerprinting. Finally, 4694 TBIT [TBIT, 2001] is a TCP fingerprinting tool that aims at 4695 characterizing the behavior of a remote TCP peer based on active 4696 probes, and which has been widely used in the research community. 4698 TBIT [TBIT, 2001] implements a number of tests not present in other 4699 tools, such as characterizing the behavior of a TCP peer with respect 4700 to TCP congestion control. 4702 [Fyodor, 1998] and [Fyodor, 2006a] are classic papers on the subject. 4703 [Miller, 2006] and [Smith and Grundl, 2002] provide an introduction 4704 to passive TCP/IP stack fingerprinting. [Smart et al, 2000] and 4705 [Beck, 2001] discuss some techniques for evading OS detection through 4706 TCP/IP stack fingerprinting. 4708 The following subsections discuss TCP-based techniques for remote OS 4709 detection via and, where possible, propose ways to mitigate them. 4711 12.1.1. FIN probe 4713 The attacker sends a FIN (or any packet without the SYN or the ACK 4714 flags set) to an open port. RFC 793 [Postel, 1981c] leaves the 4715 reaction to such segments unspecified. As a result, some 4716 implementations silently drop the received segment, while others 4717 respond with a RST. We advice implementations to silently drop any 4718 segments received for a connection in the LISTEN state that do not 4719 have the SYN, RST, or ACK flags set. In the rest of the cases, the 4720 processing rules in RFC 793 should be applied. 4722 12.1.2. Bogus flag test 4724 The attacker sends a TCP segment setting at least one bit of the 4725 Reserved field. Some implementations ignore this field, while others 4726 reset the corresponding connection or reflect the field in the TCP 4727 segment sent in response. We advice implementations to ignore any 4728 flags not supported, and not reflect them if a TCP segment is sent in 4729 response to the one just received. 4731 12.1.3. TCP ISN sampling 4733 The attacker samples a number of Initial Sequence Numbers by sending 4734 a number of connection requests. Many TCP implementations differ on 4735 the ISN generator they implement, thus allowing the correlation of 4736 ISN generation algorithm to the operating system type and version. 4738 This document advises implementing an ISN generator that follows the 4739 behavior described in RFC 1948 [Bellovin, 1996]. However, it should 4740 be noted that even if all TCP implementations generated their ISNs as 4741 proposed in RFC 1948, there is still a number of implementation 4742 details that are left unspecified, which would allow remote OS 4743 fingerprinting by means of ISN sampling. For example, the time- 4744 dependent parameter of the hash could have a different frequency in 4745 different TCP implementations. 4747 12.1.4. TCP initial window 4749 Many TCP implementations differ on the initial TCP window they use. 4750 There are a number of factors that should be considered when 4751 selecting the TCP window to be used for a given system. A number of 4752 implementations that use static windows (i.e., no automatic buffer 4753 tuning mechanisms are implemented) default to a window of around 32 4754 KB, which seems sensible for the general case. On the other hand, a 4755 window of 4 KB seems to be common practice for connections servicing 4756 critical applications such as BGP. It is clear that the window size 4757 is a tradeoff among a number of considerations. Section 3.7 4758 discusses some of the considerations that should be made when 4759 selecting the window size for a TCP connection. 4761 If automatic tuning mechanisms are implemented, we suggest the 4762 initial window to be at least 4 * RMSS segments. We note that a 4763 remote OS fingerprinting tool could still sample the advertised TCP 4764 window, trying to correlate the advertised window with the potential 4765 automatic buffer tuning algorithm and Operating System. 4767 12.1.5. RST sampling 4769 [Fyodor, 1998] reports that many implementations differ in the 4770 Acknowledgement Number they use in response to segments received for 4771 connections in the CLOSED state. In particular, these 4772 implementations differ in the way they construct the RST segment that 4773 is sent in response to those TCP segments received for connections in 4774 the CLOSED state. Here we provide advice on how the corresponding 4775 RST segments should be constructed. 4777 If the ACK bit of an incoming TCP segment is off, a Sequence Number 4778 of zero should be used in the RST segment sent in response. That is, 4779 4781 It should be noted that the SEG.LEN value used for the 4782 Acknowledgement Number should be incremented once for each flag set 4783 in the original segment that makes use of a byte of the sequence 4784 number space. That is, if only one of the SYN or FIN flags were set 4785 in the received segment, the Acknowledgement Number of the response 4786 should be set to SEG.SEQ+SEG.LEN+1. If both the SYN and FIN flags 4787 were set in the received segment, the Acknowledgement Number should 4788 be set to SEG.SEQ+SEG.LEN+2. 4790 RFC 793 [Postel, 1981c] describes (in pages 36-37) how RST segments 4791 are to be generated. According to this RFC, the ACK bit (and the 4792 Acknowledgment Number) is set in a RST only if the incoming segment 4793 that elicited the RST did not have the ACK bit set (and thus the 4794 Sequence Number of the outgoing RST segment must be set to zero). 4795 However, we recommend TCP implementations to set the ACK bit (and the 4796 Acknowledgement Number) in all outgoing RST segments, as it allows 4797 for additional validation checks to be enforced at the system 4798 receiving the segment. 4800 12.1.6. TCP options 4802 Different implementations differ in the TCP options they enable by 4803 default. Additionally, they differ in the actual contents of the 4804 options, and in the order in which the options are included in a TCP 4805 segment. There is currently no recommendation on the order in which 4806 to include TCP options in TCP segments. 4808 12.1.7. Retransmission Timeout (RTO) sampling 4810 TCP uses a retransmission timer for retransmitting data in the 4811 absence of any feedback from the remote data receiver. The duration 4812 of this timer is referred to as "retransmission timeout" (RTO). RFC 4813 2988 [Paxson and Allman, 2000] specifies the algorithm for computing 4814 the TCP retransmission timeout (RTO). 4816 The algorithm allows the use of clocks of different granularities, to 4817 accommodate the different granularities used by the existing 4818 implementations. Thus, the difference in the resulting RTO can be 4819 used for remote OS fingerprinting. [Veysset et al, 2002] describes 4820 how to perform remote OS fingerprinting by sampling and analyzing the 4821 RTO of the target system. However, this fingerprinting technique has 4822 at least the following drawbacks: 4824 o It is usually much slower than other fingerprinting techniques, as 4825 it may require considerable time to sample the RTO of a given 4826 target. 4828 o It is less reliable than other fingerprinting techniques, as 4829 latency and packet loss can lead to bogus results. 4831 While in principle it would be possible to defeat this fingerprinting 4832 technique (e.g., by obfuscating the granularity of the clock used for 4833 computing the RTO), we consider that a more important step to defeat 4834 remote OS detection is for implementations to address the more 4835 effective fingerprinting techniques described in Sections 12.1.1 4836 through 12.1.7 of this document. 4838 12.2. System uptime detection 4840 The "uptime" of a system may prove to be valuable information to an 4841 attacker. For example, it might reveal the last time a security 4842 patch was applied. Information about system uptime is usually leaked 4843 by TCP header fields or options that are (or may be) time-dependent, 4844 and are usually initialized to zero when the system is bootstrapped. 4845 As a result, if the attacker knows the frequency with which the 4846 corresponding parameter or header field is incremented, and is able 4847 to sample the current value of that parameter or header field, the 4848 system uptime will be easily obtained. Two fields that can 4849 potentially reveal the system uptime is the Sequence Number field of 4850 a SYN or SYN/ACK segment (i.e., when it contains an ISN) and the 4851 TSval field of the timestamp option. Section 3.3.1 of this document 4852 discusses the generation of TCP Initial Sequence Numbers. Section 4853 4.7.1 of this document discusses the generation of TCP timestamps. 4855 13. Covert channels 4857 As virtually every communications protocol, TCP can be exploited to 4858 establish covert channels. While an exhaustive discussion of covert 4859 channels is out of the scope of this document, for completeness of 4860 the document we simply note that it is possible for a (probably 4861 malicious) user to establish a covert channel by means of TCP, such 4862 that data can be surreptitiously passed to a remote system, probably 4863 unnoticed by a monitoring system, and with the possibility of 4864 concealing the location of the source system. 4866 In most cases, covert channels based on manipulation of TCP fields 4867 can be eliminated by protocol scrubbers and other middle-boxes. On 4868 the other hand, "timing channels" may prove to be more difficult to 4869 eliminate. 4871 [Rowland, 1996] contains a discussion of covert channels in the 4872 TCP/IP protocol suite, with some TCP-based examples. [Giffin et al, 4873 2002] describes the use of TCP timestamps for the establishment of 4874 covert channels. [Zander, 2008] contains an extensive bibliography 4875 of papers on covert channels, and a list of freely-available tools 4876 that implement covert channels with the TCP/IP protocol suite. 4878 14. TCP Port scanning 4880 TCP port scanning aims at identifying TCP port numbers on which there 4881 is a process listening for incoming connections. That is, it aims at 4882 identifying TCPs at the target system that are in the LISTEN state. 4883 The following subsections describe different TCP port scanning 4884 techniques that have been implemented in freely-available tools. 4885 These subsections focus only on those port scanning techniques that 4886 exploit features of TCP itself, and not of other communication 4887 protocols. 4889 For example, the following subsections do not discuss the 4890 exploitation of application protocols (such as FTP) or the 4891 exploitation of features of underlying protocols (such as the IP 4892 Identification field) for port-scanning purposes. 4894 14.1. Traditional connect() scan 4896 The most trivial scanning technique consists in trying to perform the 4897 TCP three-way handshake with each of the port numbers at the target 4898 system (e.g. by issuing a call to the connect() function of the 4899 Sockets API). The three-way handshake will complete for port numbers 4900 that are "open", but will fail for those port numbers that are 4901 "closed". 4903 As this port-scanning technique can be implemented by issuing a call 4904 to the connect() function of the Sockets API that normal applications 4905 use, it does not require the attacker to have superuser privileges. 4906 The downside of this port-scanning technique is that it is less 4907 efficient than other scanning methods (e.g., the "SYN scan" described 4908 in Section 14.2), and that it can be easily logged by the target 4909 system. 4911 14.2. SYN scan 4913 The SYN scan was introduced as a "stealth" port-scanning technique. 4914 It aims at avoiding the target system from logging the port scan by 4915 not completing the TCP three-way handshake. When a SYN/ACK segment 4916 is received in response to the initial SYN segment, the system 4917 performing the port scan will respond with an RST segment, thus 4918 preventing the three-way handshake from completing. While this port- 4919 scanning technique is harder to detect and log than the traditional 4920 connect() scan described in Section 14.1, most current NIDS (Network 4921 Intrusion Detection Systems) can detect and log it. 4923 SYN scans are sometimes mistakenly reported as "SYN flood" attacks by 4924 NIDS, though. 4926 The main advantage of this port scanning technique is that it is much 4927 more efficient than the traditional connect() scan. 4929 In order to implement this port-scanning technique, port-scanning 4930 tools usually bypass the TCP API, and forge the SYN segments they 4931 send (e.g., by using raw sockets). This typically requires the 4932 attacker to have superuser privileges to be able to run the port- 4933 scanning tool. 4935 14.3. FIN, NULL, and XMAS scans 4937 RFC 793 [Postel, 1981c] states, in page 65, that an incoming segment 4938 that does not have the RST bit set and that is received for a 4939 connection in the fictional state CLOSED causes an RST to be sent in 4940 response. Pages 65-66 of RFC 793 describes the processing of 4941 incoming segments for connections in the state LISTEN, and implicitly 4942 states that an incoming segment that does not have the ACK bit set 4943 (and is not a SYN or an RST) should be silently dropped. 4945 As a result, an attacker can exploit this situation to perform a port 4946 scan by sending TCP segments that do not have the ACK bit set to the 4947 target system. When a port is "open" (i.e., there is a TCP in the 4948 LISTEN state on the corresponding port), the target system will 4949 respond with an RST segment. On the other hand, if the port is 4950 "closed" (i.e., there is a TCP in the fictional state CLOSED) the 4951 attacker will not get any response from the target system. 4953 Since the only requirement for exploiting this port scanning vector 4954 is that the probe segments must not have the ACK bit set, there are a 4955 number of different TCP control-bits combinations that can be used 4956 for the probe segments. 4958 When the probe segment sent to the target system is a TCP segment 4959 that has only the FIN bit set, the scanning technique is usually 4960 referred to as a "FIN scan". When the probe packet is a TCP segment 4961 that does not have any of the control bits set, the scanning 4962 technique is usually known as a "NULL scan". Finally, when the probe 4963 packet sent to the target system has only the FIN, PSH, and the URG 4964 bits set, the port-scanning technique is known as a "XMAS scan". 4966 It should be clear that while the aforementioned control-bits 4967 combinations are the most popular ones, other combinations could be 4968 used to exploit this port-scanning vector. For example, the CWR, 4969 ECE, and/or any of the Reserved bits could be set in the probe 4970 segments. 4972 The advantage of this port-scanning technique is that in can bypass 4973 some stateless firewalls. However, the downside is that a number of 4974 implementations do not comply strictly with RFC 793 [Postel, 1981c], 4975 and thus always respond to the probe segments with an RST, regardless 4976 of whether the port is open or closed. 4978 This port-scanning vector can be easily defeated by responding with 4979 an RST when a TCP segment is received for a connection in the LISTEN 4980 state, and the incoming segment has neither the SYN bit nor the RST 4981 bit set. We recommend TCP/IP stacks to implement this alternative 4982 processing of TCP segments for connections in the LISTEN state. 4984 14.4. Maimon scan 4986 This port scanning technique was introduced in [Maimon, 1996] with 4987 the name "StealthScan" (method #1), and was later incorporated into 4988 the nmap tool [Fyodor, 2006b] as the "Maimon scan". 4990 This port scanning technique employs TCP segments that have both the 4991 FIN and ACK bits sets as the probe segments. While according to RFC 4992 793 [Postel, 1981c] these segments should elicit an RST regardless of 4993 whether the corresponding port is open or closed, a programming flaw 4994 found in a number of TCP implementations has caused some systems to 4995 silently drop the probe segment if the corresponding port was open 4996 (i.e., there was a TCP in the LISTEN state), and respond with an RST 4997 only if the port was closed. 4999 Therefore, an RST would indicate that the scanned port is closed, 5000 while the absence of a response from the target system would indicate 5001 that the scanned port is open. 5003 While this bug has not been found in current implementations of TCP, 5004 it might still be present in some legacy systems. 5006 14.5. Window scan 5008 This port-scanning technique employs ACK segments as the probe 5009 packets. ACK segments will elicit an RST from the target system 5010 regardless of whether the corresponding TCP port is open or closed. 5011 However, as described in [Maimon, 1996], some systems set the Window 5012 field of the RST segments with different values depending on whether 5013 the corresponding TCP port is open or closed. These systems set the 5014 Window field of their RST segments to zero when the corresponding TCP 5015 port is closed, and set the Window field to a non-zero value when the 5016 corresponding TCP port is open. 5018 As a result, an attacker could exploit this situation for performing 5019 a port scan by sending ACK segments to the target system, and 5020 examining the Window field of the RST segments that his probe 5021 segments elicit. 5023 In order to defeat this port-scanning technique, we recommend TCP 5024 implementations to set the Window field to zero in all the RST 5025 segments they send. 5027 Most popular implementations of TCP already implement this policy. 5029 14.6. ACK scan 5031 The so-called "ACK scan" is not really a port-scanning technique 5032 (i.e., it does not aim at determining whether a specific port is open 5033 or closed), but rather aims at determining whether some intermediate 5034 system is filtering TCP segments sent to that specific port number. 5036 The probe packet is a TCP segment with the ACK bit set which, 5037 according to RFC 793 [Postel, 1981c] should elicit an RST from the 5038 target system regardless of whether the corresponding TCP port is 5039 open or closed. If no response is received from the target system, 5040 it is assumed that some intermediate system is filtering the probe 5041 packets sent to the target system. 5043 It should be noted that this "port scanning" techniques exploits 5044 basic TCP processing rules, and therefore cannot be defeated at an 5045 end-system. 5047 15. Processing of ICMP error messages by TCP 5049 The Internet Control Message Protocol (ICMP) is used in the Internet 5050 Architecture mainly to perform a fault-isolation function, that is, 5051 the group of actions that hosts and routers take to determine that 5052 there is some network failure [Clark, 1982]. 5054 When a router detects a network problem while trying to forward an IP 5055 packet, it usually sends an ICMP error message to the source host, to 5056 raise awareness of the network problem taking place. In the same 5057 way, there are a number of scenarios in which a host may generate an 5058 ICMP error message if it finds a problem while processing an IP 5059 datagram. The received ICMP errors are handed to the corresponding 5060 transport-protocol instance, which will usually perform a fault 5061 recovery function. 5063 Unfortunately, ICMP can be exploited to perform a variety of attacks 5064 against TCP (and other similar protocols), which include blind 5065 connection-reset, blind throughput-reduction, and blind performance- 5066 degrading attacks. All of these attacks can be performed even with 5067 the attacker being off-path, without the need to sniff the packets 5068 that correspond to the attacked TCP connection. 5070 While the security implications of ICMP have been known in the 5071 research community for a long time, there is not yet an official 5072 proposal on how to deal with these vulnerabilities. However, as a 5073 result of the disclosure process carried out by the UK's National 5074 Infrastructure Security Co-ordination Centre (NISCC) (during 2004 and 5075 2005) and the publication of an IETF Internet-Draft [Gont, 2008a], 5076 virtually all current TCP implementations now incorporate some 5077 countermeasures for these attacks. 5079 The next sections provide a description of the use of ICMP to perform 5080 attacks against TCP, and describe the set of countermeasures that 5081 have become the "de facto" standard to mitigate the impact of these 5082 vulnerabilities. 5084 15.1. Internet Control Message Protocol 5086 The specification of the ICMP protocol is spread among a number of 5087 documents. This section provides a roadmap to the ICMP documents 5088 that are relevant to TCP. 5090 15.1.1. Internet Control Message Protocol for IP version 4 (ICMP) 5092 RFC 792 [Postel, 1981b] is the base specification of the Internet 5093 Control Message Protocol (ICMP) to be used with the Internet Protocol 5094 version 4 (IPv4). It defines, among other things, a number of error 5095 messages that can be used by end-systems and intermediate-systems to 5096 report network errors to the sending host. Additionally, it defines 5097 the ICMP Source Quench message (type 4, code 0), which is meant to 5098 provide a mechanism for flow control and congestion control. 5100 RFC 1122 [Braden, 1989] classifies ICMP error messages into those 5101 that indicate "soft errors", and those that indicate "hard errors", 5102 thus roughly defining the semantics of them. 5104 RFC 1191 [Mogul and Deering, 1990] defines the Path-MTU Discovery 5105 (PMTUD) mechanism, which makes use of ICMP error messages of type 3 5106 (Destination Unreachable), code 4 (fragmentation needed and DF bit 5107 set) to allow hosts to determine the MTU of an arbitrary internet 5108 path. 5110 Finally, Appendix D of RFC 4301 [Kent and Seo, 2005] provides 5111 information about which ICMP error messages are produced by hosts, 5112 routers, or both. 5114 15.1.2. Internet Control Message Protocol for IP version 6 (ICMPv6) 5116 RFC 4443 [Conta et al, 2006] specifies the Internet Control Message 5117 Protocol (ICMPv6) to be used with the Internet Protocol version 6 5118 (IPv6) [Deering and Hinden, 1998]. 5120 RFC 4443 [Conta et al, 2006] defines the "Packet Too Big" (type 2, 5121 code 0) error message, that is analogous to the ICMP "fragmentation 5122 needed and DF bit set" (type 3, code 4) error message. RFC 1981 5123 [McCann et al, 1996] defines the Path MTU Discovery mechanism for IP 5124 Version 6, that makes use of these messages to determine the MTU of 5125 an arbitrary internet path. 5127 Appendix D of RFC 4301 [Kent and Seo, 2005] provides information 5128 about which ICMPv6 error messages are generated by hosts, routers, or 5129 both. 5131 15.2. Handling of ICMP error messages 5133 RFC 1122 [Braden, 1989] states that a TCP must act on an ICMP error 5134 message passed up from the IP layer, directing it to the connection 5135 that elicited the error. 5137 In order to allow ICMP messages to be demultiplexed by the receiving 5138 host, part of the original packet that elicited the message is 5139 included in the payload of the ICMP error message. Thus, the 5140 receiving host can use that information to match the ICMP error to 5141 the transport protocol instance that elicited it. 5143 Neither RFC 793 [Postel, 1981c] nor RFC 1122 [Braden, 1989] recommend 5144 any validation checks on the received ICMP messages. Thus, as long 5145 as the ICMP payload contains the information that identifies an 5146 existing communication instance, it will be handed to the 5147 corresponding transport-protocol instance, and the corresponding 5148 action will be performed. 5150 Therefore, in the case of TCP, an attacker could send a forged ICMP 5151 message to the attacked host, and, as long as he is able to guess the 5152 four-tuple that identifies the communication instance to be attacked, 5153 he will be able to use ICMP to perform a variety of attacks. 5155 As discussed in [Watson, 2004], there are a number of scenarios in 5156 which an attacker may know or be able to guess the four-tuple that 5157 identifies a TCP connection. If we assume the attacker knows the two 5158 systems involved in the TCP connection to be attacked, both the 5159 client-side and the server-side IP addresses will be known. 5160 Furthermore, as most Internet services use the so-called "well-known" 5161 ports, only the client port number would need to be guessed. This 5162 means that an attacker would need to send, in principle, at most 5163 65536 packets to perform any ICMP-based attack against TCP. However, 5164 as many systems choose the port numbers they use for outgoing 5165 connections from a subset of the whole port number space and do not 5166 randomize the ephemeral port numbers, in practice fewer packets are 5167 needed to perform any of these attacks. 5169 15.3. Constraints in the possible solutions 5171 For ICMPv4, RFC 792 [Postel, 1981b] states that the internet header 5172 plus the first 64 bits of the packet that elicited the ICMP message 5173 are to be included in the payload of the ICMP error message. Thus, 5174 it is assumed that all data needed to identify a transport protocol 5175 instance and process the ICMP error message is contained in the first 5176 64 bits of the transport protocol header. RFC 1122 [Braden, 1989] 5177 allows implementations to optionally include more data from the 5178 original packet than those required by the original ICMP 5179 specification. Finally, RFC 1812 [Baker, 1995] recommends that ICMP 5180 error messages should contain as much of the original datagram as 5181 possible without the length of the ICMP datagram exceeding 576 bytes. 5183 Thus, for ICMP messages generated by hosts, we can only expect to get 5184 the entire IPv4 header of the original packet, plus the first 64 bits 5185 of its payload. For TCP, this means that the only fields that will 5186 be included in the ICMP payload are: the Source Port, the Destination 5187 Port, and the 32-bit TCP Sequence Number. This clearly imposes a 5188 constraint on the possible validation checks that can be performed, 5189 as there is not much information available on which these checks 5190 could be performed. 5192 These constraints mean, for example, that even if TCP were signing 5193 its segments by means of the TCP MD5 signature option specified in 5194 RFC 2385 [Heffernan, 1998], this mechanism could not be used as a 5195 counter-measure against ICMP-based attacks, because, as ICMP messages 5196 include only a piece of the TCP segment that elicited the error, the 5197 MD5 signature could not be recalculated. In the same way, even if 5198 the attacked peer was authenticating its packets at the IP layer 5199 [Kent and Seo, 2005], because only a part of the original IP packet 5200 would be available, the signature used for authentication could not 5201 be recalculated, and thus this mechanism could not be used as a 5202 counter-measure against ICMP-based attacks against TCP. 5204 For the IPv6 case, RFC 4443 [Conta et al, 2006] specifies that the 5205 payload of ICMPv6 error messages includes as many octets from the 5206 IPv6 packet that elicited the ICMPv6 error message as will fit 5207 without making the resulting ICMPv6 packet exceed the minimum IPv6 5208 MTU (1280 octets). Thus, more information is available than in the 5209 IPv4 case. 5211 Hosts could require ICMP error messages to be authenticated (e.g., by 5212 means of IPsec), in order to act upon them. However, while this 5213 requirement could make sense for those ICMP error messages sent by 5214 hosts, it would not be feasible for those ICMP error messages 5215 generated by routers, as this would imply either that the attacked 5216 host should have a security association with every existing router, 5217 or that it should be able to establish one dynamically. The current 5218 level of deployment of protocols for dynamic establishment of 5219 security associations makes this unfeasible. Also, in some cases, 5220 such as embedded devices, the processing power requirements of 5221 authentication might not allow IPsec authentication to be implemented 5222 effectively. 5224 15.4. General countermeasures against ICMP attacks 5226 There are a number of countermeasures that can be implemented to 5227 eliminate or mitigate the impact of ICMP-based attacks against TCP. 5228 The general countermeasures discussed in the following subsections 5229 help to mitigate many ICMP-based attacks against TCP. Rather than 5230 being alternative countermeasures, they can be implemented together 5231 to increase the protection against these attacks. 5233 15.4.1. TCP sequence number checking 5235 The current specifications do not impose any validity checks on the 5236 TCP segment that is contained in the ICMP payload. For instance, no 5237 checks are performed to verify that a received ICMP error message has 5238 been elicited by a segment that was "in flight" to destination. 5239 Thus, even stale ICMP error messages will be acted upon. 5241 TCP should check that the TCP Sequence Number contained in the 5242 payload of the ICMP error message should be within the range of the 5243 data already sent but not yet acknowledged. That is, 5245 SND.UNA =< Sequence Number < SND.NXT 5247 If an ICMP error message does not pass this check, it should be 5248 silently dropped. 5250 Even if an attacker were able to guess the four-tuple that identifies 5251 the TCP connection, this additional check would reduce the 5252 possibility of considering a forged ICMP packet as valid to 5253 FlightSize/232 (where FlightSize is the number of data bytes already 5254 sent to the remote peer, but not yet acknowledged, as defined in RFC 5255 2581 [Allman et al, 1999]). For connections in the SYN-SENT or SYN- 5256 RECEIVED states, this would reduce the possibility of considering a 5257 forged ICMP packet as valid to 1/232. For a TCP endpoint with no 5258 data "in flight", this would completely eliminate the possibility of 5259 success of these attacks. 5261 This check has been incorporated by most major implementations of 5262 TCP. 5264 It is important to note that while this check greatly increases the 5265 number of packets required to perform any of the attacks discussed in 5266 this document, this may not be enough in those scenarios in which 5267 bandwidth is easily available, and/or large TCP windows are in use 5268 (e.g., by means of the mechanism specified in RFC 1323 [Jacobson et 5269 al, 1992]). Therefore, implementation of the attack-specific 5270 countermeasures discussed in this document is strongly recommended. 5272 15.4.2. Port randomization 5274 As discussed in the previous sections, in order to perform any of 5275 these ICMP-based attacks, an attacker would need to guess (or know) 5276 the four-tuple that identifies the connection to be attacked. 5277 Increasing the port number range used for outgoing TCP connections, 5278 and obfuscating the ephemeral port numbers used for outgoing TCP 5279 connections would make it harder for an attacker to perform any of 5280 these blind attacks against TCP. 5282 Section 3.1 of this document discusses TCP ephemeral port 5283 randomization in great detail. 5285 15.4.3. Filtering ICMP error messages based on the ICMP payload 5287 The source address of ICMP error messages does not need to be forged 5288 to perform the ICMP-based attacks against TCP. Therefore, simple 5289 filtering based on the source address of ICMP error messages does not 5290 serve as a counter-measure against these attacks. However, a more 5291 advanced packet filtering could be implemented in firewalls and other 5292 middle-boxes, which could help to mitigate these attacks. Firewalls 5293 implementing such advanced filtering would look at the payload of the 5294 ICMP error messages, and perform ingress and egress packet filtering 5295 based on the source IP address of the IP header contained in the 5296 payload of the ICMP error message. 5298 [Gont, 2006] provides a discussion of filtering of ICMP messages 5299 based on the ICMP payload. 5301 15.5. Blind connection-reset attack 5303 15.5.1. Description 5305 When TCP is handed an ICMP error message, it will perform its fault 5306 recovery function, as follows: 5308 o If the network problem being reported is a hard error, TCP will 5309 abort the corresponding connection. 5311 o If the network problem being reported is a soft error, TCP will 5312 just record this information, and repeatedly retransmit its data 5313 until they either get acknowledged, or the connection times out. 5315 RFC 1122 [Braden, 1989] states that a host should abort a connection 5316 when receiving an ICMP error message that indicates a "hard error", 5317 and states that ICMP error messages of type 3 (Destination 5318 Unreachable) codes 2 (protocol unreachable), 3 (port unreachable), 5319 and 4 (fragmentation needed and DF bit set) should be considered to 5320 indicate hard errors. 5322 While RFC 4301 [Conta et al, 2006] did not exist when RFC 1122 was 5323 published, one could extrapolate the concept of "hard errors" to 5324 ICMPv6 error messages of type 1 (Destination unreachable) codes 1 5325 (communication with destination administratively prohibited), and 4 5326 (port unreachable). 5328 Thus, an attacker could use ICMP to perform a blind connection-reset 5329 attack. That is, even being off-path, an attacker could reset any 5330 TCP connection taking place by sending any ICMP error message that 5331 indicates a "hard error", to either of the two TCP endpoints of the 5332 connection. Because of TCP's fault recovery policy, the connection 5333 would be immediately aborted. 5335 As discussed in Section 15.2, all an attacker needs to know to 5336 perform such an attack is the socket pair that identifies the TCP 5337 connection to be attacked. In some scenarios, the IP addresses and 5338 port numbers in use may be easily guessed or known to the attacker 5339 [Watson, 2004]. 5341 Some stacks are known to propagate ICMP errors across TCP 5342 connections, increasing the impact of this attack, as a single ICMP 5343 packet could bring down all the TCP connections between the 5344 corresponding peers. 5346 It is important to note that even if TCP itself were protected 5347 against the blind connection-reset attack described in [Watson, 2004] 5348 and [NISCC, 2004], by means of IPsec authentication [Kent and Seo, 5349 2005], by means of the TCP MD5 signature option specified in RFC 2385 5350 [Heffernan, 1998], or by means of the mechanism proposed in [Ramaiah 5351 et al, 2008], the blind connection-reset attack described in this 5352 document could still succeed. 5354 15.5.2. Attack-specific countermeasures 5356 Changing the reaction to hard errors 5358 An analysis of the circumstances in which ICMP messages that indicate 5359 hard errors may be received can shed some light on how to eliminate 5360 the impact of ICMP-based blind connection-reset attacks. 5362 ICMP type 3 (Destination Unreachable), code 2 (protocol unreachable) 5364 This ICMP error message indicates that the host sending the ICMP 5365 error message received a packet meant for a transport protocol it 5366 does not support. For connection-oriented protocols such as TCP, one 5367 could expect to receive such an error as the result of a connection- 5368 establishment attempt. However, it would be strange to get such an 5369 error during the life of a connection, as this would indicate that 5370 support for that transport protocol has been removed from the host 5371 sending the error message during the life of the corresponding 5372 connection. Thus, it would be fair to treat ICMP protocol 5373 unreachable error messages as soft errors if they are meant for 5374 connections that are in synchronized states. For TCP, this means TCP 5375 should treat ICMP protocol unreachable error messages as soft errors 5376 if they are meant for connections that are in the ESTABLISHED, FIN- 5377 WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK or TIME-WAIT 5378 states. 5380 ICMP type 3 (Destination Unreachable), code 3 (port unreachable) 5382 This error message indicates that the host sending the ICMP error 5383 message received a packet meant for a socket {IP address, port 5384 number} on which there is no process listening. Those transport 5385 protocols which have their own mechanisms for notifying this 5386 condition should not be receiving these error messages. However, RFC 5387 1122 [Braden, 1989] states that even those transport protocols that 5388 have their own mechanism for notifying the sender that a port is 5389 unreachable must nevertheless accept an ICMP Port Unreachable for the 5390 same purpose. For security and robustness reasons, it would be fair 5391 to treat ICMP port unreachable messages as soft errors when they are 5392 meant for protocols that have their own mechanism for reporting this 5393 error condition. 5395 ICMP type 3 (Destination Unreachable), code 4 (fragmentation needed 5396 and DF bit set) 5398 This error message indicates that an intermediate node needed to 5399 fragment a datagram, but the DF (Don't Fragment) bit in the IPv4 5400 header was set. Those systems that do not implement the PMTUD 5401 mechanism should not be sending their IP packets with the DF bit set, 5402 and thus should not be receiving these ICMP error messages. Thus, it 5403 would be fair for them to treat this ICMP error message as indicating 5404 a soft error, therefore not aborting the corresponding connection 5405 when such an error message is received. On the other hand, and for 5406 obvious reasons, those systems implementing the Path-MTU Discovery 5407 (PMTUD) mechanism specified in RFC 1191 [Mogul and Deering, 1990] and 5408 RFC 1981 [McCann et al, 1996] should not abort a corresponding 5409 connection when such an ICMP error message is received. 5411 ICMPv6 type 1 (Destination Unreachable), code 1 (communication with 5412 destination administratively prohibited) 5414 This error message indicates that the destination is unreachable 5415 because of an administrative policy. For connection-oriented 5416 protocols such as TCP, one could expect to receive such an error as 5417 the result of a connection-establishment attempt. Receiving such an 5418 error for a connection in any of the synchronized states would mean 5419 that the administrative policy changed during the life of the 5420 connection. Therefore, while it would be possible for a firewall to 5421 be reconfigured during the life of a connection, it would be fair, 5422 for security and robustness reasons, to ignore these messages for 5423 connections that are in the ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, 5424 CLOSE-WAIT, CLOSING, LAST-ACK or TIME-WAIT states. 5426 ICMPv6 type 1 (Destination Unreachable), code 4 (port unreachable) 5428 This error message is analogous to the ICMP type 3 (Destination 5429 Unreachable), code 3 (Port unreachable) error message discussed 5430 above. Therefore, the same considerations apply. 5432 Therefore, TCP should treat all ICMP error messages as indicating 5433 "soft errors" when they are meant for connections in any of the 5434 synchronized states and therefore should not abort the corresponding 5435 connection upon receipt of them. Also, as discussed in Section 5436 15.5.1, hosts should not extrapolate ICMP errors across TCP 5437 connections. 5439 In case the received message was legitimate, it would mean that the 5440 "hard error" condition appeared during the life of the connection. 5441 However, there is no reason to think that in the same way this error 5442 condition appeared, it would not get solved in the near term. 5443 Therefore, treating the received ICMP error messages as "soft errors" 5444 would make TCP more robust, and could avoid TCP from aborting a TCP 5445 connection unnecessarily. Aborting the connection would be to ignore 5446 the valuable feature of the Internet that for many internal failures 5447 it reconstructs its function without any disruption of the end points 5448 [Clark, 1982]. 5450 It is interesting to note that, as ICMP error messages are 5451 unreliable, transport protocols should not depend on them for correct 5452 functioning. In the event one of these messages was legitimate, the 5453 corresponding connection would eventually time out. Also, 5454 applications may still be notified asynchronously about the received 5455 error messages, and thus may still abort their connections on their 5456 own if they consider it appropriate. 5458 This counter-measure has become the "de facto" standard for dealing 5459 with the so-called ICMP "hard errors" when they are received for 5460 connection in any of the synchronized states. 5462 Delaying the connection reset 5464 An alternative counter-measure would be, in the case of connections 5465 in any of the synchronized states, to honour the ICMP error messages 5466 only if there is no progress on the connection. Rather than 5467 immediately aborting a connection, a TCP would abort a connection 5468 only after an ICMP error message indicating a hard error has been 5469 received, and the corresponding data have already been retransmitted 5470 more than some specified number of times. 5472 The rationale behind this proposed fix is that if a host can make 5473 forward progress on a connection, it can completely disregard the 5474 "hard errors" being indicated by the received ICMP error messages. 5475 However, while this counter-measure could be useful, the one 5476 described earlier in this section is easier to implement, and 5477 provides increased protection against this type of attack. 5479 15.6. Blind throughput-reduction attack 5481 15.6.1. Description 5483 RFC 1122 [Braden, 1989] states that hosts must react to ICMP Source 5484 Quench messages by slowing transmission on the connection. Thus, an 5485 attacker could send ICMP Source Quench (type 4, code 0) messages to a 5486 TCP endpoint to make it reduce the rate at which it sends data to the 5487 other end-point of the connection. RFC 1122 further adds that the 5488 recommended procedure is to put the corresponding connection in the 5489 slow-start phase of the TCP's congestion control algorithm (described 5490 at the time in [Jacobson, 1988], and currently standardized by RFC 5491 2581 [Allman et al, 1999]). In the case of those implementations 5492 that use an initial congestion window of one segment, a sustained 5493 attack would reduce the throughput of the attacked connection to 5494 about SMSS (Sender Maximum Segment Size) bytes per RTT (round-trip 5495 time). The throughput achieved during attack might be higher if a 5496 larger initial congestion window is in use, as specified in RFC 3390 5497 [Allman et al, 2002]. 5499 15.6.2. Attack-specific countermeasures 5501 RFC 1122 [Braden, 1989] states that hosts must react to ICMP Source 5502 Quench messages by slowing transmission on the connection. However, 5503 as discussed in RFC 1812 [Baker, 1995], research seems to suggest 5504 ICMP Source Quench is an ineffective (and unfair) antidote for 5505 congestion. RFC 1812 further states that routers should not send 5506 ICMP Source Quench messages in response to congestion. On the other 5507 hand, TCP implements its own congestion control mechanisms [Allman et 5508 al, 1999] [Ramakrishnan et al, 2001], that do not depend on ICMP 5509 Source Quench messages. Thus, hosts should silently drop ICMP Source 5510 Quench messages that are meant for TCP connections. 5512 15.7. Blind performance-degrading attack 5514 15.7.1. Description 5516 When one IP host has a large amount of data to send to another host, 5517 the data will be transmitted as a series of IP datagrams. It is 5518 usually preferable that these datagrams be of the largest size that 5519 does not require fragmentation anywhere along the path from the 5520 source to the destination. This datagram size is referred to as the 5521 Path MTU (PMTU), and is equal to the minimum of the MTUs of each hop 5522 in the path [Mogul and Deering, 1990]. 5524 A technique called "Path MTU Discovery" (PMTUD) mechanism lets IP 5525 hosts determine the Path MTU of an arbitrary internet path. RFC 1191 5526 [Mogul and Deering, 1990] and RFC 1981 [McCann et al, 1996] specify 5527 the PMTUD mechanism for IPv4 and IPv6, respectively. 5529 The PMTUD mechanism for IPv4 uses the Don't Fragment (DF) bit in the 5530 IPv4 header to dynamically discover the Path MTU. The basic idea 5531 behind the PMTUD mechanism is that a source host assumes that the MTU 5532 of the path is that of the first hop, and sends all its datagrams 5533 with the DF bit set. If any of the datagram is too large to be 5534 forwarded without fragmentation by some intermediate router, the 5535 router will discard the corresponding datagram, and will return an 5536 ICMP "Destination Unreachable" (type 3) "fragmentation needed and DF 5537 set" (code 4) error message to sending host. This message will 5538 report the MTU of the constricting hop, so that the sending host can 5539 reduce the assumed Path-MTU accordingly. 5541 For IPv6, intermediate systems do not fragment packets. Thus, there 5542 is an "implicit" DF bit set in every packet sent on an IPv6 network. 5543 If any of the datagrams is too large to be forwarded without 5544 fragmentation by some intermediate router, the router will discard 5545 the corresponding datagram, and will return an ICMPv6 "Packet Too 5546 Big" (type 2, code 0) error message to the sending host. This 5547 message will report the MTU of the constricting hop, so that the 5548 sending host can reduce the assumed Path-MTU accordingly. 5550 As discussed in both RFC 1191 [Mogul and Deering, 1990] and RFC 1981 5551 [McCann et al, 1996], the Path-MTU Discovery mechanism can be used to 5552 attack TCP. An attacker could send a forged ICMP "Destination 5553 Unreachable, fragmentation needed and DF set" error message (or its 5554 ICMPv6 counterpart) to the sending host, advertising a small Next-Hop 5555 MTU. As a result, the attacked system would reduce the size of the 5556 packets it sends for the corresponding connection accordingly. 5558 The effect of this attack is two-fold. On one hand, it will increase 5559 the headers/data ratio, thus increasing the overhead needed to send 5560 data to the remote TCP end-point. On the other hand, if the attacked 5561 system wanted to keep the same throughput it was achieving before 5562 being attacked, it would have to increase the packet rate. On 5563 virtually all systems this will lead to an increase in the IRQ 5564 (Interrrupt ReQuest) rate, thus increasing processor utilization, and 5565 degrading the overall system performance. A particular scenario that 5566 may take place is that in which an attacker reports a Next-Hop MTU 5567 smaller than or equal to the amount of bytes needed for headers (IP 5568 header, plus TCP header). For example, if the attacker reports a 5569 Next-Hop MTU of 68 bytes, and the amount of bytes used for headers 5570 (IPv4 header, plus TCP header) is larger than 68 bytes, the assumed 5571 Path-MTU will not even allow the attacked host to send a single byte 5572 of application data without fragmentation. This particular scenario 5573 might lead to unpredictable results. Another possible scenario is 5574 that in which a TCP connection is being secured by means of IPsec 5575 [Kent and Seo, 2006]. If the Next-Hop MTU reported by the attacker 5576 is smaller than the amount of bytes needed for headers (IP and IPsec, 5577 in this case), the assumed Path-MTU will not even allow the attacked 5578 host to send a single byte of the TCP header without fragmentation. 5579 This is another scenario that might lead to unpredictable results. 5581 For IPv4, the reported Next-Hop MTU could be as low as 68 octets, as 5582 RFC 791 [Postel, 1981a] requires every internet module to be able to 5583 forward a datagram of 68 octets without further fragmentation. For 5584 IPv6, the reported Next-Hop MTU could be as low as 1280 octets (the 5585 minimum IPv6 MTU, as specified by RFC 2460 [Deering and Hinden, 5586 1998]). 5588 Recently, the PMTUD WG [PMTUDWG, 2007] of the IETF produced the 5589 document RFC 4821 [Mathis and Heffner, 2007], which specifies a 5590 mechanism for discovering the Path-MTU known as "Packetization Layer 5591 Path MTU Discovery" (PLPMTUD), which does not rely on ICMP error 5592 messages. This mechanism can be implemented as a replacement for the 5593 traditional Path-MTU Discovery mechanism specified in RFC 1191 [Mogul 5594 and Deering, 1990] and RFC 1981 [McCann et al, 1996], or only for 5595 black-hole detection. 5597 "Black-holes" are caused by routers that discard packets that are too 5598 large to be forwarded without fragmentation (and have the IP DF bit 5599 set), without sending an ICMP error message to the sending endpoint. 5600 An equivalent scenario is that in which the router that discards the 5601 packets does send an ICMP error message to the sending endpoint, but 5602 some intermediate system (such as a firewall) consistently drops the 5603 corresponding ICMP error messages [Lahey, 2000]. 5605 While replacement of the traditional Path-MTU Discovery mechanism 5606 with PLPMTUD would eliminate the attack vector described in this 5607 section, the convergence time of PLPMTUD is typically longer than 5608 that of the traditional PMTUD mechanism, and thus a number TCP 5609 implementers seem to be unwilling to implement PLPMTUD as a complete 5610 replacement for the traditional PMTUD mechanism. 5612 15.7.2. Attack-specific countermeasures 5614 Henceforth, we will refer to both ICMP "fragmentation needed and DF 5615 bit set" and ICMPv6 "Packet Too Big" error messages as "ICMP Packet 5616 Too Big" error messages. 5618 In addition to the general validation check described in Section 5619 15.4.1, processing of ICMP "Packet Too Big" error message could be 5620 delayed as described in Section 15.5.2, to greatly mitigate the 5621 impact of this attack. 5623 This would mean that upon receipt of an ICMP "Packet Too Big" error 5624 message, TCP would just record this information, and would honour it 5625 only when the corresponding data had already been retransmitted a 5626 specified number of times. 5628 While this policy would mitigate the impact of the attack against the 5629 PMTUD mechanism, it would also mean that it might take TCP more time 5630 to discover the Path-MTU for a TCP connection. This would be 5631 particularly annoying for connections that have just been 5632 established, as it might take TCP several transmission attempts (and 5633 the corresponding timeouts) before it discovers the PMTU for the 5634 corresponding connection. Thus, this policy would increase the time 5635 it takes for data to begin to be received at the destination host. 5637 We would like to protect TCP from the attack against the PMTUD 5638 mechanism, while still allowing TCP to quickly determine the initial 5639 Path-MTU for a connection. To achieve both goals, we can divide the 5640 traditional PMTUD mechanism into two stages: Initial Path-MTU 5641 Discovery and Path-MTU Update. 5643 The Initial Path-MTU Discovery stage is when TCP tries to send 5644 segments that are larger than the ones that have so far been sent and 5645 acknowledged for this connection. That is, in the Initial Path-MTU 5646 Discovery stage TCP has no record of these large segments getting to 5647 the destination host, and thus it would be fair to believe the 5648 network when it reports that these packets are too large to reach the 5649 destination host without being fragmented. 5651 The Path-MTU Update stage is when TCP is asked to reduce the size of 5652 the segments it sends to a value that is equal to or smaller than 5653 that of the largest TCP segment that has so far been sent and 5654 acknowledged for this connection. During the Path-MTU Update stage, 5655 TCP already has knowledge of the estimated Path-MTU for the given 5656 connection. Thus, it would be fair to be more cautious with the 5657 errors being reported by the network. 5659 In order to allow TCP to distinguish segments between those 5660 performing Initial Path-MTU Discovery and those performing Path-MTU 5661 Update, two new variables would need to be introduced to TCP: 5662 maxsizeacked and maxsizesent. 5664 maxsizesent would hold the size (in octets) of the largest packet 5665 that has so far been sent for this connection. It would be 5666 initialized to 68 (the minimum IPv4 MTU) when the underlying internet 5667 protocol is IPv4, and would be initialized to 1280 (the minimum IPv6 5668 MTU) when the underlying internet protocol is IPv6. Whenever a 5669 packet larger than maxsizesent octets is sent, maxsizesent should be 5670 set to that value. 5672 On the other hand, maxsizeacked would hold the size (in octets) of 5673 the largest packet that has so far been acknowledged for this 5674 connection. It would be initialized to 68 (the minimum IPv4 MTU) 5675 when the underlying internet protocol is IPv4, and would be 5676 initialized to 1280 (the minimum IPv6 MTU) when the underlying 5677 internet protocol is IPv6. Whenever an acknowledgement for a packet 5678 larger than maxsizeacked octets is received, maxsizeacked should be 5679 set to the size of that acknowledged packet. 5681 Upon receipt of an ICMP "Packet Too Big" error message, the Next-Hop 5682 MTU claimed by the ICMP message (henceforth "claimedmtu") should be 5683 compared with maxsizesent. If claimedmtu is equal to or larger than 5684 maxsizesent, then the ICMP error message should be silently 5685 discarded. The rationale for this policy is that the ICMP error 5686 message cannot be legitimate if it claims to have been elicited by a 5687 packet larger than the largest packet we have so far sent for this 5688 connection. 5690 If this check is passed, claimedmtu should be compared with 5691 maxsizeacked. If claimedmtu is equal to or larger than maxsizeacked, 5692 TCP is supposed to be in the Initial Path-MTU Discovery stage, and 5693 thus the ICMP "Packet Too Big" error message should be honoured 5694 immediately. That is, the assumed Path-MTU should be updated 5695 according to the Next-Hop MTU claimed in the ICMP error message. 5696 Also, maxsizesent should be reset to the minimum MTU of the internet 5697 protocol in use (68 for IPv4, and 1280 for IPv6). 5699 On the other hand, if claimedmtu is smaller than maxsizeacked, TCP is 5700 supposed to be in the Path-MTU Update stage. At this stage, TCP 5701 should be more cautious with the errors being reported by the 5702 network, and should therefore just record the received error message, 5703 and delay the update of the assumed Path-MTU. 5705 To perform this delay, one new variable and one new parameter should 5706 be introduced to TCP: nsegrto and MAXSEGRTO. nsegrto will hold the 5707 number of times a specified segment has timed out. It should be 5708 initialized to zero, and should be incremented by one every time the 5709 corresponding segment times out. MAXSEGRTO would specify the number 5710 of times a given segment must timeout before an ICMP "Packet Too Big" 5711 error message can be honoured, and could be set, in principle, to any 5712 value greater than or equal to 0. 5714 Thus, if nsegrto is greater than or equal to MAXSEGRTO, and there's a 5715 pending ICMP "Packet Too Big" error message, the corresponding error 5716 message should be honoured. maxsizeacked should be set to claimedmtu, 5717 and maxsizesent should be set to 68 (for IPv4) or 1280 (for IPv6). 5719 If while there is a pending ICMP "Packet Too Big" error message the 5720 TCP Sequence Number claimed by the pending ICMP error message is 5721 acknowledged (i.e., an ACK that acknowledges that sequence number is 5722 received), then the "pending error" condition should be cleared. 5724 The rationale behind performing this delayed processing of ICMP 5725 "Packet Too Big" error messages is that if there is progress on the 5726 connection, the ICMP "Packet Too Big" errors must be a false claim. 5727 By checking for progress on the connection, rather than just for 5728 staleness (i.e., checking the embedded TCP Sequence Number) of the 5729 received ICMP messages, TCP is protected from attack even if the 5730 offending ICMP messages are "in window", and therefore as a 5731 corollary, is made more robust to spurious ICMP messages elicited by, 5732 for example, corrupted TCP segments. 5734 MAXSEGRTO can be set, in principle, to any value greater than or 5735 equal to 0. Setting MAXSEGRTO to 0 would make TCP perform the 5736 traditional PMTUD mechanism defined in RFC 1191 [Mogul and Deering, 5737 1990] and RFC 1981 [McCann et al, 1996]. A MAXSEGRTO of 1 should 5738 provide enough protection for most scenarios. In any case, 5739 implementations are free to choose higher values for this constant. 5740 MAXSEGRTO could be a function of the Next-Hop MTU claimed in the 5741 received ICMP "Packet Too Big" message. That is, higher values for 5742 MAXSEGRTO could be imposed when the received ICMP "Packet Too Big" 5743 message claims a Next-Hop MTU that is smaller than some specified 5744 value. 5746 In the event a higher level of protection was desired at the expense 5747 of a higher delay in the discovery of the Path-MTU, an implementation 5748 could consider TCP to always be in the Path-MTU Update stage, thus 5749 always delaying the update of the assumed Path-MTU. 5751 The current PMTUD mechanism, as specified by RFC 1191 [Mogul and 5752 Deering, 1990] and RFC 1981 [McCann et al, 1996], still suffers from 5753 some functionality problems described in RFC 2923 [Lahey, 2000] that 5754 the proposed countermeasure does not aim to address. A mechanism 5755 that addresses those issues is specified in RFC 4821 [Mathis and 5756 Heffner, 2007]. 5758 [Gont, 2008a] provides further details nd analysys of this attack- 5759 specific countermeasures. 5761 16. TCP interaction with the Internet Protocol (IP) 5763 16.1. TCP-based traceroute 5765 The traceroute tool is used to identify the intermediate systems the 5766 local system and the destination system. It is usually implemented 5767 by sending "probe" packets with increasing IP Time to Live values 5768 (starting from 0), without maintaining any state with the final 5769 destination. 5771 Some traceroute implementations use ICMP "echo request" messages as 5772 the probe packets, while others use UDP packets or TCP SYN segments. 5774 In some cases, the state-less nature of the traceroute tool may 5775 prevent it from working correctly across stateful devices such as 5776 Network Address Translators (NATs) or firewalls. 5778 In order to by-pass this limitation, an attacker could establish a 5779 TCP connection with the destination system, and start sending TCP 5780 segments on that connection with increasing IP Time to Live values 5781 (starting from 0) [Zalewski, 2007] [Zalewski, 2008]. Provided ICMP 5782 error messages are not blocked by any intermediate system, an 5783 attacker could exploit this technique to map the network topology 5784 behind the aforementioned stateful devices in scenarios in which he 5785 could not have achieved this goal using the traditional traceroute 5786 tool. 5788 NATs [Srisuresh and Egevang, 2001] and other middle-boxes could 5789 defeat this network-mapping technique by overwriting the Time to Live 5790 of the packets they forward to the internal network. For example, 5791 they could overwrite the Time to Live of all packets being forwarded 5792 to an internal network with a value such as 128. We strongly 5793 recommend against overwriting the IP Time to Live field with the 5794 value 255 or other similar large values, as this could allow an 5795 attacker to bypass the protection provided by the Generalized TTL 5796 Security Mechanism (GTSM) described in RFC 5087 [Gill et al, 2007]. 5798 [Gont and Srisuresh, 2008] discusses the security implications of 5799 NATs, and proposes mitigations for this and other issues. 5801 16.2. Blind TCP data injection through fragmented IP traffic 5803 As discussed in Section 11.2, TCP data injection attacks usually 5804 require an attacker to guess or know a number of parameters related 5805 with the target TCP connection, such as the connection-id {Source 5806 Address, Source Port, Destination Address, Destination Port}, the TCP 5807 Sequence Number, and the TCP Acknowledgement Number. Provided these 5808 values are obfuscated as recommended in this document, the chances of 5809 an off-path attacker of successfully performing a data injection 5810 attack against a TCP connection are fairly low for many of the most 5811 common scenarios. 5813 As discussed in this document, randomization of the values contained 5814 in different TCP header fields is not a replacement for cryptographic 5815 methods for protecting a TCP connection, such as IPsec (specified in 5816 RFC 4301 [Kent and Seo, 2005]). 5818 However, [Zalewski, 2003b] describes a possible vector for performing 5819 a TCP data injection attack that does not require the attacker to 5820 guess or know the aforementioned TCP connection parameters, and could 5821 therefore be successfully exploited in some scenarios with less 5822 effort than that required to exploit the more traditional data- 5823 injection attack vectors. 5825 The attack vector works as follows. When one system is transferring 5826 information to a remote peer by means of TCP, and the resulting 5827 packet gets fragmented, the first fragment will usually contain the 5828 entire TCP header which, together with the IP header, includes all 5829 the connection parameters that an attacker would need to guess or 5830 know to successfully perform a data injection attack against TCP. If 5831 an attacker were able to forge all the fragments other than the first 5832 one, his forged fragments could be reassembled together with the 5833 legitimate first fragment, and thus he would be relieved from the 5834 hard task of guessing or knowing connection parameters such as the 5835 TCP Sequence Number and the TCP Acknowledgement Number. 5837 In order to successfully exploit this attack vector, the attacker 5838 should be able to guess or know both of the IP addresses involved in 5839 the target TCP connection, the IP Identification value used for the 5840 specific packet he is targeting, and the TCP Checksum of that target 5841 packet. While it would seem that these values are hard to guess, in 5842 some specific scenarios, and with some security-unwise implementation 5843 approaches for the TCP and IP protocols, these values may be feasible 5844 to guess or know. For example, if the sending system uses 5845 predictable IP Identification values, the attacker could simply 5846 perform a brute force attack, trying each of the possible 5847 combinations for the TCP Checksum field. In more specific scenarios, 5848 the attacker could have more detailed knowledge about the data being 5849 transferred over the target TCP connection, which might allow him to 5850 predict the TCP Checksum of the target packet. For example, if both 5851 of the involved TCP peers used predictable values for the TCP 5852 Sequence Number and for the IP Identification fields, and the 5853 attacker knew the data being transferred over the target TCP 5854 connection, he could be able to carefully forge the IP payload of his 5855 IP fragments so that the checksum of the reassembled TCP segment 5856 matched the Checksum included in the TCP header of the first (and 5857 legitimate) IP fragment. 5859 As discussed in Section 4.1 of [CPNI, 2008], IP fragmentation 5860 provides a vector for performing a variety of attacks against an IP 5861 implementation. Therefore, we discourage the reliance on IP 5862 fragmentation by end-systems, and recommend the implementation of 5863 mechanisms for the discovery of the Path-MTU, such as that described 5864 in Section 15.7.3 of this document and/or that described in RFC 4821 5865 [Mathis and Heffner, 2007]. We nevertheless recommend randomization 5866 of the IP Identification field as described in Section 3.5.2 of 5867 [CPNI, 2008]. While randomization of the IP Identification field 5868 does not eliminate this attack vector, it does require more work on 5869 the side of the attacker to successfully exploit it. 5871 16.3. Broadcast and multicast IP addresses 5873 TCP connection state is maintained between only two endpoints at a 5874 time. As a result, broadcast and multicast IP addresses should not 5875 be allowed for the establishment of TCP connections. Section 4.3 of 5876 [CPNI, 2008] provides advice about which specific IP address blocks 5877 should not be allowed for connection-oriented protocols such as TCP. 5879 17. Security Considerations 5880 18. Acknowledgements 5882 This document is heavily based on the document "Security Assessment 5883 of the Transmission Control Protocol (TCP)" [CPNI, 2009] written by 5884 Fernando Gont on behalf of CPNI (Centre for the Protection of 5885 National Infrastructure). 5887 The author would like to thank (in alphabetical order) Randall 5888 Atkinson, Guillermo Gont, Alfred Hoenes, Jamshid Mahdavi, Stanislav 5889 Shalunov, Michael Welzl, Dan Wing, Andrew Yourtchenko, Michael 5890 Zalewski, and Christos Zoulas, for providing valuable feedback on 5891 earlier versions of the UK CPNI document. 5893 Additionally, the author would like to thank (in alphabetical order) 5894 Mark Allman, David Black, Ethan Blanton, David Borman, James Chacon, 5895 John Heffner, Jerrold Leichter, Jamshid Mahdavi, Keith Scott, Bill 5896 Squier, and David White, who generously answered a number of 5897 questions that araised while the aforementioned document was being 5898 written. 5900 Finally, the author would like to thank CPNI (formely NISCC) for 5901 their continued support. 5903 19. References 5905 Abley, J., Savola, P., Neville-Neil, G. 2007. Deprecation of Type 0 5906 Routing Headers in IPv6. RFC 5095. 5908 Allman, M. 2003. TCP Congestion Control with Appropriate Byte 5909 Counting (ABC). RFC 3465. 5911 Allman, M. 2008. Comments On Selecting Ephemeral Ports. Available 5912 at: http://www.icir.org/mallman/share/ports-dec08.pdf 5914 Allman, M., Paxson, V., Stevens, W. 1999. TCP Congestion Control. 5915 RFC 2581. 5917 Allman, M., Balakrishnan, H., Floyd, S. 2001. Enhancing TCP's Loss 5918 Recovery Using Limited Transmit. RFC 3042. 5920 Allman, M., Floyd, S., and C. Partridge. 2002. Increasing TCP's 5921 Initial Window. RFC 3390. 5923 Baker, F. 1995. Requirements for IP Version 4 Routers. RFC 1812. 5925 Baker, F., Savola, P. 2004. Ingress Filtering for Multihomed 5926 Networks. RFC 3704. 5928 Barisani, A. 2006. FTester - Firewall and IDS testing tool. 5929 Available at: http://dev.inversepath.com/trac/ftester 5931 Beck, R. 2001. Passive-Aggressive Resistance: OS Fingerprint 5932 Evasion. Linux Journal. 5934 Bellovin, S. M. 1989. Security Problems in the TCP/IP Protocol 5935 Suite. Computer Communication Review, Vol. 19, No. 2, pp. 32-48. 5937 Bellovin, S. M. 1996. Defending Against Sequence Number Attacks. 5938 RFC 1948. 5940 Bellovin, S. M. 2006. Towards a TCP Security Option. IETF Internet- 5941 Draft (draft-bellovin-tcpsec-00.txt), work in progress. 5943 Bernstein, D. J. 1996. SYN cookies. Available at: 5944 http://cr.yp.to/syncookies.html 5946 Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., and Weiss, 5947 W., 1998. An Architecture for Differentiated Services. RFC 2475. 5949 Blanton, E., Allman, M., Fall, K., Wang, L. 2003. A Conservative 5950 Selective Acknowledgment (SACK)-based Loss Recovery Algorithm for 5951 TCP. RFC 3517. 5953 Borman, D. 1997. Post to the tcp-impl mailing-list. Message-Id: 5954 <199706061526.KAA01535@frantic.BSDI.COM>. Available at: 5955 http://www.kohala.com/start/borman.97jun06.txt 5957 Borman, D., Deering, S., Hinden, R. 1999. IPv6 Jumbograms. RFC 5958 2675. 5960 Braden, R. 1989. Requirements for Internet Hosts -- Communication 5961 Layers. RFC 1122. 5963 Braden, R. 1992. Extending TCP for Transactions -- Concepts. RFC 5964 1379. 5966 Braden, R. 1994. T/TCP -- TCP Extensions for Transactions Functional 5967 Specification. RFC 1644. 5969 CCSDS. 2006. Consultative Committee for Space Data Systems (CCSDS) 5970 Recommendation Communications Protocol Specification (SCPS) -- 5971 Transport Protocol (SCPS-TP). Blue Book. Issue 2. Available at: 5972 http://public.ccsds.org/publications/archive/714x0b2.pdf 5974 CERT. 1996. CERT Advisory CA-1996-21: TCP SYN Flooding and IP 5975 Spoofing Attacks. Available at: 5977 http://www.cert.org/advisories/CA-1996-21.html 5979 CERT. 1997. CERT Advisory CA-1997-28 IP Denial-of-Service Attacks. 5980 Available at: http://www.cert.org/advisories/CA-1997-28.html 5982 CERT. 2000. CERT Advisory CA-2000-21: Denial-of-Service 5983 Vulnerabilities in TCP/IP Stacks. Available at: 5984 http://www.cert.org/advisories/CA-2000-21.html 5986 CERT. 2001. CERT Advisory CA-2001-09: Statistical Weaknesses in 5987 TCP/IP Initial Sequence Numbers. Available at: 5988 http://www.cert.org/advisories/CA-2001-09.html 5990 CERT. 2003. CERT Advisory CA-2003-13 Multiple Vulnerabilities in 5991 Snort Preprocessors. Available at: 5992 http://www.cert.org/advisories/CA-2003-13.html 5994 Cisco. 2008a. Cisco Security Appliance Command Reference, Version 5995 7.0. Available at: http://www.cisco.com/en/US/docs/security/asa/ 5996 asa70/command/reference/tz.html#wp1288756 5998 Cisco. 2008b. Cisco Security Appliance System Log Messages, Version 5999 8.0. Available at: http://www.cisco.com/en/US/docs/security/asa/ 6000 asa80/system/message/logmsgs.html#wp4773952 6002 Clark, D.D. 1982. Fault isolation and recovery. RFC 816. 6004 Clark, D.D. 1988. The Design Philosophy of the DARPA Internet 6005 Protocols, Computer Communication Review, Vol. 18, No.4, pp. 106-114. 6007 Connolly, T., Amer, P., Conrad, P. 1994. An Extension to TCP : 6008 Partial Order Service. RFC 1693. 6010 Conta, A., Deering, S., Gupta, M. 2006. Internet Control Message 6011 Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) 6012 Specification. RFC 4443. 6014 CORE. 2003. Core Secure Technologies Advisory CORE-2003-0307: Snort 6015 TCP Stream Reassembly Integer Overflow Vulnerability. Available at: 6016 http://www.coresecurity.com/common/showdoc.php?idx=313&idxseccion=10 6018 CPNI, 2008. Security Assessment of the Internet Protocol. Available 6019 at: http://www.cpni.gov.uk/Docs/InternetProtocol.pdf 6021 CPNI, 2009. Security Assessment of the Transmission Control Protocol 6022 (TCP). Available at: 6023 http://www.cpni.gov.uk/Docs/tn-03-09-security-assessment-TCP.pdf 6024 daemon9, route, and infinity. 1996. IP-spoofing Demystified (Trust- 6025 Relationship Exploitation), Phrack Magazine, Volume Seven, Issue 6026 Forty-Eight, File 14 of 18. Available at: 6027 http://www.phrack.org/archives/48/P48-14 6029 Deering, S., Hinden, R. 1998. Internet Protocol, Version 6 (IPv6) 6030 Specification. RFC 2460. 6032 Dharmapurikar, S., Paxson, V. 2005. Robust TCP Stream Reassembly In 6033 the Presence of Adversaries. Proceedings of the USENIX Security 6034 Symposium 2005. 6036 Duke, M., Braden, R., Eddy, W., Blanton, E. 2006. A Roadmap for 6037 Transmission Control Protocol (TCP) Specification Documents. RFC 6038 4614. 6040 Ed3f. 2002. Firewall spotting and networks analisys with a broken 6041 CRC. Phrack Magazine, Volume 0x0b, Issue 0x3c, Phile #0x0c of 0x10. 6042 Available at: http://www.phrack.org/phrack/60/p60-0x0c.txt 6044 Eddy, W. 2007. TCP SYN Flooding Attacks and Common Mitigations. RFC 6045 4987. 6047 Fenner, B. 2006. Experimental Values in IPv4, IPv6, ICMPv4, ICMPv6, 6048 UDP, and TCP Headers. RFC 4727. 6050 Ferguson, P., and Senie, D. 2000. Network Ingress Filtering: 6051 Defeating Denial of Service Attacks which employ IP Source Address 6052 Spoofing. RFC 2827. 6054 Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., 6055 Leach, P., and Berners-Lee, T. 1999. Hypertext Transfer Protocol -- 6056 HTTP/1.1. RFC 2616. 6058 Floyd, S., Mahdavi, J., Mathis, M., Podolsky, M. 2000. An Extension 6059 to the Selective Acknowledgement (SACK) Option for TCP. RFC 2883. 6061 Floyd, S., Henderson, T., Gurtov, A. 2004. The NewReno Modification 6062 to TCP's Fast Recovery Algorithm. RFC 3782. 6064 Floyd, S., Allman, M., Jain, A., Sarolahti, P. 2007. Quick-Start for 6065 TCP and IP. RFC 4782. 6067 Fyodor. 1998. Remote OS Detection via TCP/IP Stack Fingerprinting. 6068 Phrack Magazine, Volume 8, Issue, 54. 6070 Fyodor. 2006a. Remote OS Detection via TCP/IP Fingerprinting (2nd 6071 Generation). Available at: http://insecure.org/nmap/osdetect/. 6073 Fyodor. 2006b. Nmap - Free Security Scanner For Network Exploration 6074 and Audit. Available at: http://www.insecure.org/nmap. 6076 Fyodor. 2008. Nmap Reference Guide: Port Scanning Techniques. 6077 Available at: http://nmap.org/book/man-port-scanning-techniques.html 6079 GIAC. 2000. Egress Filtering v 0.2. Available at: 6080 http://www.sans.org/y2k/egress.htm 6082 Giffin, J., Greenstadt, R., Litwack, P., Tibbetts, R. 2002. Covert 6083 Messaging through TCP Timestamps. PET2002 (Workshop on Privacy 6084 Enhancing Technologies), San Francisco, CA, USA, April 2002. 6085 Available at: 6086 http://web.mit.edu/greenie/Public/CovertMessaginginTCP.ps 6088 Gill, V., Heasley, J., Meyer, D., Savola, P, Pignataro, C. 2007. The 6089 Generalized TTL Security Mechanism (GTSM). RFC 5082. 6091 Gont, F. 2006. Advanced ICMP packet filtering. Available at: 6092 http://www.gont.com.ar/papers/icmp-filtering.html 6094 Gont, F. 2008a. ICMP attacks against TCP. IETF Internet-Draft 6095 (draft-ietf-tcpm-icmp-attacks-04.txt), work in progress. 6097 Gont, F.. 2008b. TCP's Reaction to Soft Errors. IETF Internet-Draft 6098 (draft-ietf-tcpm-tcp-soft-errors-09.txt), work in progress. 6100 Gont, F. 2009. On the generation of TCP timestamps. IETF Internet- 6101 Draft (draft-gont-tcpm-tcp-timestamps-01.txt), work in progress. 6103 Gont, F., Srisuresh, P. 2008. Security Implications of Network 6104 Address Translators (NATs). IETF Internet-Draft 6105 (draft-gont-behave-nat-security-01.txt), work in progress. 6107 Gont, F., Yourtchenko, A. 2009. On the implementation of TCP urgent 6108 data. IETF Internet-Draft (draft-gont-tcpm-urgent-data-01.txt), work 6109 in progress. 6111 Heffernan, A. 1998. Protection of BGP Sessions via the TCP MD5 6112 Signature Option. RFC 2385. 6114 Heffner, J. 2002. High Bandwidth TCP Queuing. Senior Thesis. 6116 Hoenes, A. 2007. TCP options - tcp-parameters IANA registry. Post 6117 to the tcpm wg mailing-list. Available at: 6118 http://www.ietf.org/mail-archive/web/tcpm/current/msg03199.html 6120 IANA. 2007. Transmission Control Protocol (TCP) Option Numbers. 6122 Avialable at: http://www.iana.org/assignments/tcp-parameters/ 6124 IANA. 2008. Port Numbers. Available at: 6125 http://www.iana.org/assignments/port-numbers 6127 Jacobson, V. 1988. Congestion Avoidance and Control. Computer 6128 Communication Review, vol. 18, no. 4, pp. 314-329. Available at: 6129 ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z 6131 Jacobson, V., Braden, R. 1988. TCP Extensions for Long-Delay Paths. 6132 RFC 1072. 6134 Jacobson, V., Braden, R., Borman, D. 1992. TCP Extensions for High 6135 Performance. RFC 1323. 6137 Jones, S. 2003. Port 0 OS Fingerprinting. Available at: 6138 http://www.gont.com.ar/docs/port-0-os-fingerprinting.txt 6140 Kent, S. and Seo, K. 2005. Security Architecture for the Internet 6141 Protocol. RFC 4301. 6143 Klensin, J. 2008. Simple Mail Transfer Protocol. RFC 5321. 6145 Ko, Y., Ko, S., and Ko, M. 2001. NIDS Evasion Method named SeolMa. 6146 Phrack Magazine, Volume 0x0b, Issue 0x39, phile #0x03 of 0x12. 6147 Available at: http://www.phrack.org/issues.html?issue=57&id=3#article 6149 Lahey, K. 2000. TCP Problems with Path MTU Discovery. RFC 2923. 6151 Larsen, M., Gont, F. 2008. Port Randomization. IETF Internet-Draft 6152 (draft-ietf-tsvwg-port-randomization-02), work in progress. 6154 Lemon, 2002. Resisting SYN flood DoS attacks with a SYN cache. 6155 Proceedings of the BSDCon 2002 Conference, pp 89-98. 6157 Maimon, U. 1996. Port Scanning without the SYN flag. Phrack 6158 Magazine, Volume Seven, Issue Fourty-Nine, phile #0x0f of 0x10. 6159 Available at: 6160 http://www.phrack.org/issues.html?issue=49&id=15#article 6162 Mathis, M., Mahdavi, J., Floyd, S. Romanow, A. 1996. TCP Selective 6163 Acknowledgment Options. RFC 2018. 6165 Mathis, M., and Heffner, J. 2007. Packetization Layer Path MTU 6166 Discovery. RFC 4821. 6168 McCann, J., Deering, S., Mogul, J. 1996. Path MTU Discovery for IP 6169 version 6. RFC 1981. 6171 McKusick, M., Bostic, K., Karels, M., and J. Quarterman. 1996. The 6172 Design and Implementation of the 4.4BSD Operating System. Addison- 6173 Wesley. 6175 Meltman. 1997. new TCP/IP bug in win95. Post to the bugtraq mailing- 6176 list. Available at: http://insecure.org/sploits/land.ip.DOS.html 6178 Miller, T. 2006. Passive OS Fingerprinting: Details and Techniques. 6179 Available at: http://www.ouah.org/incosfingerp.htm . 6181 Mogul, J., and Deering, S. 1990. Path MTU Discovery. RFC 1191. 6183 Morris, R. 1985. A Weakness in the 4.2BSD Unix TCP/IP Software. 6184 Technical Report CSTR-117, AT&T Bell Laboratories. Available at: 6185 http://pdos.csail.mit.edu/~rtm/papers/117.pdf . 6187 Myst. 1997. Windows 95/NT DoS. Post to the bugtraq mailing-list. 6188 Available at: http://seclists.org/bugtraq/1997/May/0039.html 6190 Nichols, K., Blake, S., Baker, F., and Black, D. 1998. Definition of 6191 the Differentiated Services Field (DS Field) in the IPv4 and IPv6 6192 Headers. RFC 2474. 6194 NISCC. 2004. NISCC Vulnerability Advisory 236929: Vulnerability 6195 Issues in TCP. Available at: 6196 http://www.uniras.gov.uk/niscc/docs/re-20040420-00391.pdf 6198 NISCC. 2005. NISCC Vulnerability Advisory 532967/NISCC/ICMP: 6199 Vulnerability Issues in ICMP packets with TCP payloads. Available 6200 at: http://www.niscc.gov.uk/niscc/docs/re-20050412-00303.pdf 6202 NISCC. 2006. NISCC Technical Note 01/2006: Egress and Ingress 6203 Filtering. Available at: 6204 http://www.niscc.gov.uk/niscc/docs/re-20060420-00294.pdf?lang=en 6206 Ostermann, S. 2008. tcptrace tool. Tool and documentation available 6207 at: http://www.tcptrace.org. 6209 Paxson, V., Allman, M. 2000. Computing TCP's Retransmission Timer. 6210 RFC 2988. 6212 PCNWG. 2009. Congestion and Pre-Congestion Notification (pcn) 6213 charter. Available at: 6214 http://www.ietf.org/html.charters/pcn-charter.html 6216 PMTUDWG. 2007. Path MTU Discovery (pmtud) charter. Available at: 6217 http://www.ietf.org/html.charters/OLD/pmtud-charter.html 6218 Postel, J. 1981a. Internet Protocol. DARPA Internet Program. 6219 Protocol Specification. RFC 791. 6221 Postel, J. 1981b. Internet Control Message Protocol. RFC 792. 6223 Postel, J. 1981c. Transmission Control Protocol. DARPA Internet 6224 Program. Protocol Specification. RFC 793. 6226 Postel, J. 1987. TCP AND IP BAKE OFF. RFC 1025. 6228 Ptacek, T. H., and Newsham, T. N. 1998. Insertion, Evasion and 6229 Denial of Service: Eluding Network Intrusion Detection. Secure 6230 Networks, Inc. Available at: 6231 http://www.aciri.org/vern/Ptacek-Newsham-Evasion-98.ps 6233 Ramaiah, A., Stewart, R., and Dalal, M. 2008. Improving TCP's 6234 Robustness to Blind In-Window Attacks. IETF Internet-Draft 6235 (draft-ietf-tcpm-tcpsecure-10.txt), work in progress. 6237 Ramakrishnan, K., Floyd, S., and Black, D. 2001. The Addition of 6238 Explicit Congestion Notification (ECN) to IP. RFC 3168. 6240 Rekhter, Y., Li, T., Hares, S. 2006. A Border Gateway Protocol 4 6241 (BGP-4). RFC 4271. 6243 Rivest, R. 1992. The MD5 Message-Digest Algorithm. RFC 1321. 6245 Rowland, C. 1997. Covert Channels in the TCP/IP Protocol Suite. 6246 First Monday Journal, Volume 2, Number 5. Available at: 6247 http://www.firstmonday.org/issues/issue2_5/rowland/ 6249 Savage, S., Cardwell, N., Wetherall, D., Anderson, T. 1999. TCP 6250 Congestion Control with a Misbehaving Receiver. ACM Computer 6251 Communication Review, 29(5), October 1999. 6253 Semke, J., Mahdavi, J., Mathis, M. 1998. Automatic TCP Buffer 6254 Tuning. ACM Computer Communication Review, Vol. 28, No. 4. 6256 Shalunov, S. 2000. Netkill. Available at: 6257 http://www.internet2.edu/~shalunov/netkill/netkill.html 6259 Shimomura, T. 1995. Technical details of the attack described by 6260 Markoff in NYT. Message posted in USENET's comp.security.misc 6261 newsgroup, Message-ID: <3g5gkl$5j1@ariel.sdsc.edu>. Available at: 6262 http://www.gont.com.ar/docs/post-shimomura-usenet.txt. 6264 Silbersack, M. 2005. Improving TCP/IP security through randomization 6265 without sacrificing interoperability. EuroBSDCon 2005 Conference. 6267 SinFP. 2006. Net::SinFP - a Perl module to do OS fingerprinting. 6268 Available at: 6269 http://www.gomor.org/cgi-bin/index.pl?mode=view;page=sinfp 6271 Smart, M., Malan, G., Jahanian, F. 2000. Defeating TCP/IP Stack 6272 Fingerprinting. Proceedings of the 9th USENIX Security Symposium, 6273 pp. 229-240. Available at: http://www.usenix.org/publications/ 6274 library/proceedings/sec2000/full_papers/smart/smart_html/index.html 6276 Smith, C., Grundl, P. 2002. Know Your Enemy: Passive Fingerprinting. 6277 The Honeynet Project. 6279 Spring, N., Wetherall, D., Ely, D. 2003. Robust Explicit Congestion 6280 Notification (ECN) Signaling with Nonces. RFC 3540. 6282 Srisuresh, P., Egevang, K. 2001. Traditional IP Network Address 6283 Translator (Traditional NAT). RFC 3022. 6285 Stevens, W. R. 1994. TCP/IP Illustrated, Volume 1: The Protocols. 6286 Addison-Wesley Professional Computing Series. 6288 TBIT. 2001. TBIT, the TCP Behavior Inference Tool. Available at: 6289 http://www.icir.org/tbit/ 6291 Touch, J. 2007. Defending TCP Against Spoofing Attacks. RFC 4953. 6293 US-CERT. 2001. US-CERT Vulnerability Note VU#498440: Multiple TCP/IP 6294 implementations may use statistically predictable initial sequence 6295 numbers. Available at: http://www.kb.cert.org/vuls/id/498440 6297 US-CERT. 2003a. US-CERT Vulnerability Note VU#26825: Cisco Secure 6298 PIX Firewall TCP Reset Vulnerability. Available at: 6299 http://www.kb.cert.org/vuls/id/26825 6301 US-CERT. 2003b. US-CERT Vulnerability Note VU#464113: TCP/IP 6302 implementations handle unusual flag combinations inconsistently. 6303 Available at: http://www.kb.cert.org/vuls/id/464113 6305 US-CERT. 2004a. US-CERT Vulnerability Note VU#395670: FreeBSD fails 6306 to limit number of TCP segments held in reassembly queue. Available 6307 at: http://www.kb.cert.org/vuls/id/395670 6309 US-CERT. 2005a. US-CERT Vulnerability Note VU#102014: Optimistic TCP 6310 acknowledgements can cause denial of service. Available at: 6311 http://www.kb.cert.org/vuls/id/102014 6313 US-CERT. 2005b. US-CERT Vulnerability Note VU#396645: Microsoft 6314 Windows vulnerable to DoS via LAND attack. Available at: 6316 http://www.kb.cert.org/vuls/id/396645 6318 US-CERT. 2005c. US-CERT Vulnerability Note VU#637934: TCP does not 6319 adequately validate segments before updating timestamp value. 6320 Available at: http://www.kb.cert.org/vuls/id/637934 6322 US-CERT. 2005d. US-CERT Vulnerability Note VU#853540: Cisco PIX 6323 fails to verify TCP checksum. Available at: 6324 http://www.kb.cert.org/vuls/id/853540. 6326 Veysset, F., Courtay, O., Heen, O. 2002. New Tool And Technique For 6327 Remote Operating System Fingerprinting. Intranode Research Team. 6329 Watson, P. 2004. Slipping in the Window: TCP Reset Attacks, 6330 CanSecWest 2004 Conference. 6332 Welzl, M. 2008. Internet congestion control: evolution and current 6333 open issues. CAIA guest talk, Swinburne University, Melbourne, 6334 Australia. Available at: 6335 http://www.welzl.at/research/publications/caia-jan08.pdf 6337 Wright, G. and W. Stevens. 1994. TCP/IP Illustrated, Volume 2: The 6338 Implementation. Addison-Wesley. 6340 Zalewski, M. 2001a. Strange Attractors and TCP/IP Sequence Number 6341 Analysis. Available at: 6342 http://lcamtuf.coredump.cx/oldtcp/tcpseq.html 6344 Zalewski, M. 2001b. Delivering Signals for Fun and Profit. 6345 Available at: http://lcamtuf.coredump.cx/signals.txt 6347 Zalewski, M. 2002. Strange Attractors and TCP/IP Sequence Number 6348 Analysis - One Year Later. Available at: 6349 http://lcamtuf.coredump.cx/newtcp/ 6351 Zalewski, M. 2003a. Windows URG mystery solved! Post to the bugtraq 6352 mailing-list. Available at: 6353 http://lcamtuf.coredump.cx/p0f-help/p0f/doc/win-memleak.txt 6355 Zalewski, M. 2003b. A new TCP/IP blind data injection technique? 6356 Post to the bugtraq mailing-list. Available at: 6357 http://lcamtuf.coredump.cx/ipfrag.txt 6359 Zalewski, M. 2006a. p0f passive fingerprinting tool. Available at: 6360 http://lcamtuf.coredump.cx/p0f.shtml 6362 Zalewski, M. 2006b. p0f - RST+ signatures. Available at: 6363 http://lcamtuf.coredump.cx/p0f-help/p0f/p0fr.fp 6364 Zalewski, M. 2007. 0trace - traceroute on established connections. 6365 Post to the bugtraq mailing-list. Available at: 6366 http://seclists.org/bugtraq/2007/Jan/0176.html 6368 Zalewski, M. 2008. Museum of broken packets. Available at: 6369 http://lcamtuf.coredump.cx/mobp/ 6371 Zander, S. 2008. Covert Channels in Computer Networks. Available 6372 at: http://caia.swin.edu.au/cv/szander/cc/index.html 6374 Zuquete, A. 2002. Improving the functionality of SYN cookies. 6th 6375 IFIP Communications and Multimedia Security Conference (CMS 2002). 6376 Available at: http://www.ieeta.pt/~avz/pubs/CMS02.html 6378 Zweig, J., Partridge, C. 1990. TCP Alternate Checksum Options. RFC 6379 1146. 6381 Appendix A. TODO list 6383 A Number of formatting issues still have to be fixed in this 6384 document. Among others are: 6386 o The ASCII-art corresponding to some figures are still missing. We 6387 still have to convert the nice JPGs of the UK CPNI document into 6388 ugly ASCII-art. 6390 o The references have not yet been converted to xml, but are 6391 hardcoded, instead. That's why they may not look as expected 6393 Appendix B. Advice and guidance to vendors 6395 Vendors are urged to contact CSIRTUK (csirt@cpni.gsi.gov.uk) if they 6396 think they may be affected by the issues described in this document. 6397 As the lead coordination center for these issues, CPNI is well placed 6398 to give advice and guidance as required. 6400 CPNI works extensively with government departments and agencies, 6401 commercial organizations and the academic community to research 6402 vulnerabilities and potential threats to IT systems especially where 6403 they may have an impact on Critical National Infrastructure's (CNI). 6405 Other ways to contact CPNI, plus CPNI's PGP public key, are available 6406 at http://www.cpni.gov.uk/ . 6408 Author's Address 6410 Fernando Gont 6411 UK Centre for the Protection of National Infrastructure 6413 Email: fernando@gont.com.ar 6414 URI: http://www.cpni.gov.uk