idnits 2.17.1 draft-ietf-v6ops-pmtud-ecmp-problem-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 18, 2015) is 3112 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'Ether' is mentioned on line 238, but not defined -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 v6ops M. Byerly 3 Internet-Draft Fastly 4 Intended status: Informational M. Hite 5 Expires: April 20, 2016 Evernote 6 J. Jaeggli 7 Fastly 8 October 18, 2015 10 Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB) 11 draft-ietf-v6ops-pmtud-ecmp-problem-05 13 Abstract 15 This document calls attention to the problem of delivering ICMPv6 16 type 2 "Packet Too Big" (PTB) messages to the intended destination 17 (typically the server) in ECMP load balanced or anycast network 18 architectures. It discusses operational mitigations that can be 19 employed to address this class of failures. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on April 20, 2016. 38 Copyright Notice 40 Copyright (c) 2015 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 2. Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 57 3. Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . 4 58 3.1. Alternative Mitigations . . . . . . . . . . . . . . . . . 5 59 3.2. Implementation . . . . . . . . . . . . . . . . . . . . . 5 60 3.2.1. Alternative Implementation . . . . . . . . . . . . . 6 61 4. Improvements . . . . . . . . . . . . . . . . . . . . . . . . 7 62 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 63 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 64 7. Security Considerations . . . . . . . . . . . . . . . . . . . 7 65 8. Informative References . . . . . . . . . . . . . . . . . . . 8 66 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 68 1. Introduction 70 Operators of popular Internet services face complex challenges 71 associated with scaling their infrastructure. One scaling approach 72 is to utilize equal-cost multi-path (ECMP) routing to perform 73 stateless distribution of incoming TCP or UDP sessions to multiple 74 servers or to middle boxes such as load balancers. Distribution of 75 traffic in this manner presents a problem when dealing with ICMP 76 signaling. Specifically, an ICMP error is not guaranteed to hash via 77 ECMP to the same destination as its corresponding TCP or UDP session. 78 A case where this is particularly problematic operationally is path 79 MTU discovery RFC 1981 PMTUD [RFC1981]. 81 2. Problem 83 A common application for stateless load balancing of TCP or UDP flows 84 is to perform an initial subdivision of flows in front of a stateful 85 load balancer tier or multiple servers so that the workload becomes 86 divided into manageable fractions of the total number of flows. The 87 flow division is performed using ECMP forwarding and a stateless but 88 sticky algorithm for hashing across the available paths (see RFC 2991 89 [RFC2991] for background on ECMP routing). This nexthop selection 90 for the purposes of flow distribution is a constrained form of 91 anycast topology, where all anycast destinations are equidistant from 92 the upstream router responsible for making the last next-hop 93 forwarding decision before the flow arrives on the destination 94 device. In this approach, the hash is performed across some set of 95 available protocol headers. Typically, these headers may include all 96 or a subset of (IPv6) Flow-Label, IP-source, IP-destination, 97 protocol, source-port, destination-port and potentially others such 98 as ingress interface. 100 A problem common to this approach of distribution through hashing is 101 impact on path MTU discovery. An ICMPv6 type 2 PTB message generated 102 on an intermediate device for a packet sent from a server that is 103 part of an ECMP load balanced service to a client will have the load 104 balanced anycast address as the destination and hence will be 105 statelessly load balanced to one of the servers. While the ICMPv6 106 PTB message contains as much of the packet that could not be 107 forwarded as possible, the payload headers are not considered in the 108 forwarding decision and are ignored. Because the PTB message is not 109 identifiable as part of the original flow by the IP or upper layer 110 packet headers, the results of the ICMPv6 ECMP hash calculation are 111 unlikely to be hashed to the same nexthop as packets matching the TCP 112 or UDP ECMP hash of the flow. 114 An example packet flow and topology follow. The packet for which the 115 PTB message was generated was intended for the client. 117 ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination 119 router --> load balancer 1 ---> 120 \\--> load balancer 2 ---> load-balanced service 121 \--> load balancer N ---> 123 Figure 1 125 The router ECMP decision is used because it is part of the forwarding 126 architecture, can be performed at line rate, and does not depend on 127 shared state or coordination across a distributed forwarding system 128 which may include multiple linecards or routers. The ECMP routing 129 decision is deterministic with respect to packets having the same 130 computed hash. 132 A typical case where ICMPv6 PTB messages are received at the load 133 balancer is a case where the path MTU from the client to the load 134 balancer is limited by a tunnel in which the client itself is not 135 aware of. 137 Direct experience says that the frequency of PTB messages is small 138 compared to total flows. One possible conclusion being that tunneled 139 IPv6 deployments that cannot carry 1500 MTU packets are relatively 140 rare. Techniques employed by clients such as happy-eyeballs may 141 actually contribute some amelioration to the IPv6 client experience 142 by preferring IPv4 in cases that might be identified as failures. 144 Still, the expectation of operators is that PMTUD should work and 145 that unnecessary breakage of client traffic should be avoided. 147 A final observation regarding server tuning is that it is not always 148 possible even if it is potentially desirable to be able to 149 independently set the TCP MSS for different address families on some 150 end-systems. On Linux platforms, advmss may be set on a per route 151 basis for selected destinations in cases where discrimination by 152 route is possible. 154 The problem as described does also impact IPv4; however 155 implementation of RFC 4821 [RFC4821] TCP MTU probing, the ability to 156 fragment on wire at tunnel ingress points and the relative rarity of 157 sub-1500 byte MTUs that are not coupled to changes in client behavior 158 (for example, endpoint VPN clients set the tunnel interface MTU 159 accordingly to avoid fragmentation for performance reasons) makes the 160 problem sufficiently rare that some existing deployments have choosen 161 to ignore it. 163 3. Mitigation 165 Mitigation of the potential for PTB messages to be mis-delivered 166 involves ensuring that an ICMPv6 error message is distributed to the 167 same anycast server responsible for the flow for which the error is 168 generated. With apppropiate hardware support, mitigation could be 169 done by the mechanism hosts use to identify the flow; by looking into 170 the payload of the ICMPv6 message (to determine which TCP flow it was 171 associated with) before making a forwarding decision. Because the 172 encapsulated IP header occurs at a fixed offset in the ICMP message 173 it is not outside the realm of possibility that routers with 174 sufficient header processing capability could parse that far into the 175 payload. Employing a mediation device that handles the parsing and 176 distribution of PTB messages after policy routing or on each load- 177 balancer/server is a possibility. 179 Another mitigation approach is predicated upon distributing the PTB 180 message to all anycast servers under the assumption that the one for 181 which the message was intended will be able to match it to the flow 182 and update the route cache with the new MTU and that devices not able 183 to match the flow will discard these packets. Such distribution has 184 potentially significant implications for resource consumption and for 185 self-inflicted denial-of-service if not carefully employed. 186 Fortunately, in real-world deployments we have observed that the 187 number of flows for which this problem occurs is relatively small 188 (example, 10 or fewer pps on 1Gb/s or more worth of https traffic in 189 a real world deployment); sensible ingress rate limiters which will 190 discard excessive message volume can be applied to protect even very 191 large anycast server tiers with the potential for fallout limited to 192 circumstances of deliberate duress. 194 3.1. Alternative Mitigations 196 As an alternative, it may be appropriate to lower the TCP MSS to 1220 197 in order to accommodate 1280 byte MTU. We consider this undesirable 198 as hosts may not be able to independently set TCP MSS by address- 199 family thereby impacting IPv4, or alternatively that middle-boxes 200 need to be employed to clamp the MSS independently from the end- 201 systems. Potentially, extension headers might further alter the 202 lower bound that the MSS would have to be set to, making clamping 203 still more undesirable. 205 3.2. Implementation 207 1. Filter-based-forwarding matches next-header ICMPv6 type-2 and 208 matches a next-hop on a particular subnet directly attached to 1 209 or more routers. The filter is policed to reasonable limits (we 210 chose 1000pps, more conservative rates might be required in other 211 implementations). 213 2. Filter is applied on input side of all external (internet or 214 customer facing) interfaces. 216 3. A proxy located at the next-hop forwards ICMPv6 type-2 packets 217 received at the next-hop to an Ethernet broadcast address 218 (example ff:ff:ff:ff:ff:ff) on all specified subnets. This was 219 necessitated by router inability (in IPv6) to forward the same 220 packet to multiple unicast next-hops. 222 4. Anycasted servers receive the PTB error and process packet as 223 needed. 225 A simple Python scapy script that can perform the ICMPv6 proxy 226 reflection is included. 228 #!/usr/bin/python 230 from scapy.all import * 232 IFACE_OUT = ["p2p1", "p2p2"] 234 def icmp6_callback(pkt): 235 if pkt.haslayer(IPv6) and (ICMPv6PacketTooBig in pkt) \ 236 and pkt[Ether].dst != 'ff:ff:ff:ff:ff:ff': 237 del(pkt[Ether].src) 238 pkt[Ether].dst = 'ff:ff:ff:ff:ff:ff' 239 pkt.show() 240 for iface in IFACE_OUT: 241 sendp(pkt, iface=iface) 243 def main(): 244 sniff(prn=icmp6_callback, filter="icmp6 \ 245 and (ip6[40+0] == 2)", store=0) 247 if __name__ == '__main__': 248 main() 250 This example script listens on all interfaces for IPv6 PTB errors 251 being forwarded using filter-based-forwarding. It removes the 252 existing Ethernet source and rewrites a new Ethernet destination of 253 the Ethernet broadcast address. It then sends the resulting frame 254 out the p2p1 and p2p2 interfaces which attached to vlans where our 255 anycast servers reside. 257 3.2.1. Alternative Implementation 259 Alternatively, network designs in which a common layer 2 network 260 exists on the ECMP hop could distribute the proxy onto the end 261 systems, eliminating the need for policy routing. They could then 262 rewrite the destination -- for example, using iptables before 263 forwarding the packet back to the network containing all of the 264 server or load balancer interfaces. This implmentation can be done 265 entirely within the Linux iptables firewall. Because of the 266 distributed nature of the filter, more conservative rate limits are 267 required than when a global rate limit can be employed. 269 An example ip6tables / nftables rule to match icmp6 traffic, not 270 match broadcast traffic, impose a rate limit of 10 pps, and pass to a 271 target destination would resemble: 273 ip6tables -I INPUT -i lo -p icmpv6 -m icmpv6 --icmpv6-type 2/0 \ 274 -m pkttype ! --pkt-type broadcast -m limit --limit 10/second \ 275 -j TEE 2001:DB8::1 277 As with the scapy example, once the destination has been rewritten 278 from a hardcoded ND entry to an Ethernet broadcast address -- in this 279 case to an IPv6 documentation address -- the traffic will be 280 reflected to all the hosts on the subnet. 282 4. Improvements 284 There are several ways that improvements could be made to the problem 285 how to ECMP load balance of ICMPv6 PTB messages. little in the way of 286 Internet protocol specification change is required, rather we forsee 287 practical implemention change which insofar as we are aware does not 288 exist in current router switch or layer3/4 load balancers. 289 alternatively improved behavior on the part of client/server 290 detection of path mtu in band could render the behavior of devices in 291 the path irrelevant. 293 1. Routers with sufficient capacity within the lookup process could 294 parse all the way through the L3 or L4 header in the ICMPv6 295 payload beginning at bit offset 32 of the ICMP header. By 296 reordering the elements of the hash to match the inward direction 297 of the flow, the PTB error could be directed to the same next-hop 298 as the incoming packets in the flow. 300 2. The FIB (Forwarding Information Base) on the router could be 301 programmed with a multicast distribution tree that included all 302 of the necessary next-hops, and unicast ICMPv6 packets could be 303 policy routed to these destinations. 305 3. Ubiquitous implementation of RFC 4821 [RFC4821] Packetization 306 Layer Path MTU Discovery would probably go a long way towards 307 reducing dependence on ICMPv6 PTB by end systems. 309 5. Acknowledgements 311 The authors would like to thank Marak Majkowsiki for contributing 312 text, examples, and a very close review. The authors would like to 313 thank Mark Andrews, Brian Carpenter, Nick Hilliard and Ray Hunter, 314 for review. 316 6. IANA Considerations 318 This memo includes no request to IANA. 320 7. Security Considerations 322 The employed mitigation has the potential to greatly amplify the 323 impact of a deliberately malicious sending of ICMPv6 PTB messages. 324 Sensible ingress rate limiting can reduce the potential for impact; 325 however, legitimate PMTUD messages may be lost once the rate limit is 326 reached; analogous to other cases where DOS traffic can crowd out 327 legitimate traffic. 329 The proxy replication results in devices on the subnet not associated 330 with the flow that generated the PTB, being recipients of the ICMPv6 331 PTB message; which contains a large fragment of the packet that 332 exceeded the allowable MTU. This replication of the packet freagment 333 could arguably result in information disclosure. Recipient machines 334 should be in a common administrative domain. 336 8. Informative References 338 [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 339 for IP version 6", RFC 1981, DOI 10.17487/RFC1981, August 340 1996, . 342 [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 343 Multicast Next-Hop Selection", RFC 2991, DOI 10.17487/ 344 RFC2991, November 2000, 345 . 347 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 348 Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, 349 . 351 Authors' Addresses 353 Matt Byerly 354 Fastly 355 Kapolei, HI 356 US 358 Email: suckawha@gmail.com 360 Matt Hite 361 Evernote 362 Redwood City, CA 363 US 365 Email: mhite@hotmail.com 366 Joel Jaeggli 367 Fastly 368 Mountain View, CA 369 US 371 Email: joelja@gmail.com