idnits 2.17.1 draft-v6ops-pmtud-ecmp-problem-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (August 24, 2014) is 3531 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'Ether' is mentioned on line 227, but not defined Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 v6ops M. Byerly 3 Internet-Draft Fastly 4 Intended status: Informational M. Hite 5 Expires: February 25, 2015 Evernote 6 J. Jaeggli 7 Fastly 8 August 24, 2014 10 Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB) 11 draft-v6ops-pmtud-ecmp-problem-00 13 Abstract 15 This document calls attention to the problem of delivering ICMPv6 16 type 2 "Packet Too Big" (PTB) messages to intended destinations in 17 ECMP load balanced, anycast network architectures. It discusses 18 operational mitigations that can address this class of failure. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on February 25, 2015. 37 Copyright Notice 39 Copyright (c) 2014 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 55 2. Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 56 3. Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . 4 57 3.1. Alternatives . . . . . . . . . . . . . . . . . . . . . . 5 58 3.2. Implementation . . . . . . . . . . . . . . . . . . . . . 5 59 4. Improvements . . . . . . . . . . . . . . . . . . . . . . . . 6 60 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 61 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 62 7. Security Considerations . . . . . . . . . . . . . . . . . . . 7 63 8. Informative References . . . . . . . . . . . . . . . . . . . 7 64 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 7 66 1. Introduction 68 Operators of popular Internet services face unique challenges 69 associated with scaling their infrastructure. One approach is to 70 utilize equal-cost multi-path (ECMP) routing to perform stateless 71 distribution of incoming TCP or UDP sessions to multiple servers or 72 middle boxes such as load balancers. Distribution of traffic in this 73 manner presents a problem when dealing with ICMP signaling. 74 Specifically, an ICMP error is not guaranteed to hash via ECMP to the 75 same destination as its corresponding TCP or UDP session. A case 76 where this is particularly problematic operationally is path MTU 77 discovery (PMTUD). 79 2. Problem 81 A common application for stateless load balancing of TCP or UDP flows 82 is to perform an initial subdivision of flows in front of a stateful 83 load balancer tier or multiple servers so that the workload becomes 84 divided into manageable fractions of the total number of flows. The 85 flow division is performed using ECMP forwarding and a stateless but 86 sticky algorithm for hashing across the available paths. This 87 nexthop selection for the purposes of flow distribution is a 88 constrained form of anycast d where all anycast destinations are 89 equidistant topologically from the upstream router responsible for 90 making the last next-hop forwarding decision before the flow arrives 91 on the destination device. In this approach, the hash is performed 92 across some set of available protocol headers. Typically, these 93 headers may include (IPv6)Flow-Label, IP-source, IP-destination, 94 protocol, source-port, destination-port and potentially others such 95 as ingress interface. 97 A problem common to this approach of distribution through hashing is 98 impact on path MTU discovery. An ICMPv6 type 2 PTB message generated 99 on an intermediate device for a packet sent from an ECMP load 100 balanced server to a client, will have the load-balanced anycast 101 address as the destination and will be statelessly load balanced to 102 one of the anycast servers. While the ICMPv6 PTB message contains as 103 much of the packet that could not be forwarded as possible, the 104 payload headers do not factor into the forwarding decision and are 105 ignored. Because the PTB message is not identifiable as part of the 106 original flow by the packet header the results of the ICMPv6 ECMP 107 hash are unlikely to be hashed to the same nexthop as packets 108 matching TCP or UDP ECMP hash. 110 An example packet flow and topology follow. 112 ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination 114 router --> load balancer 1 ---> 115 \\--> load balancer 2 ---> load-balanced service 116 \--> load balancer N ---> 118 Figure 1 120 The router ECMP decision is used because it is part of the forwarding 121 architecture, can be performed at line rate, and does not depend on 122 shared state or coordination across a distributed forwarding 123 architecture which may include multiple linecards or routers. The 124 ECMP routing decision is deterministic with respect to packets having 125 the same computed hash. 127 The typical case where ICMPv6 PTB messages are received at the load 128 balancer is a case where the path MTU from the client to the load 129 balancer is limited by a tunnel in which the client itself is not 130 aware of. In the common case of a TCP flow where TLS is employed, 131 the first packet that is likely to exceed a tunnel MTU lower than 132 that specified by the MSS on the client and the load balancer/server 133 is the TLS ServerHello and certificate. 135 Direct experience says that the frequency of PTB messages is small 136 compared to total flows. One possible conclusion being that tunneled 137 IPv6 deployments that cannot carry 1500 mtu packets are relatively 138 rare. Techniques employed by clients such as happy-eyeballs may 139 actually contribute some amelioration to the IPv6 client experience 140 by preferring IPv4 in cases that might be identified as slow or 141 failed. Still, the expectation of operators is that PMTUD should 142 work and that unnecessary breakage of client traffic should be 143 avoided. 145 A final observation regarding server tuning is that it is typically 146 not possible even if it is potentially desirable to be able to 147 independently set the TCP MSS for different address families on end- 148 systems. 150 The problem as described does also impact IPv4; however, the ability 151 to fragment on wire at tunnel ingress points and the relative rarity 152 of sub-1500 byte MTUs that are not coupled to changes in client 153 behavior (for example, endpoint VPN clients set the tunnel interface 154 MTU accordingly for performance reasons) makes the problem 155 sufficiently rare that some deployments simply choose to ignore it. 157 3. Mitigation 159 Mitigation of the potential for PTB messages to be mis-delivered 160 involves ensuring that an ICMPv6 error message is distributed to the 161 same anycast server responsible for the flow for which the error is 162 generated. Ideally Mitigation could be done by the mechanism hosts 163 use to identify the flow, by looking into the payload of the ICMPv6 164 message (to determine which TCP flow it was associated with) before 165 making a forwarding decision. Because the encapsulated IP header 166 occurs at a fixed offset in the icmp message it is not outside the 167 realm of possibility that routers with sufficient header processing 168 capability could parse that far into the payload. Employing a 169 mediation device that handles the parsing and distribution of PTB 170 messages after policy routing or on each load-balancer/server is a 171 possibility. 173 Another mitigation approach is predicated upon distributing the PTB 174 message to all anycast servers under the assumption that the one for 175 which the message was intended will be able to match it to the flow 176 and update the route cache with the new MTU, devices not able to 177 match the flow will discard these packets. Such distribution has 178 potentially significant implications for resource consumption and the 179 potential for self-inflicted denial-of-service if not carefully 180 employed. Fortunately, in real-world-deployment we have observed 181 that, the number of flows for which this problem occurs is relatively 182 small (example, 10 or fewer pps on 1Gb/s or more worth of https 183 traffic) and sensible ingress rate limiters which will discard 184 excessive message volume can be applied to protect even very large 185 anycast server tiers with the potential for fallout only under 186 circumstances of deliberate duress. 188 3.1. Alternatives 190 As an alternative, it may be appropriate to lower the TCP MSS to 1220 191 in order to accommodate 1280 byte MTU. We consider this undesirable 192 as hosts may not be able to independently set TCP MSS by address- 193 family thereby impacting IPv4, or alternatively that it relies on a 194 middle-box to clamp the MSS independently from the end-systems. 196 3.2. Implementation 198 1. Filter-based-forwarding matches next-header ICMPv6 type-2 and 199 matches a next-hop on a particular subnet directly attached to 200 both border routers. The filter is policed to reasonable limits 201 (we chose 1000pps). 203 2. Filter is applied on input side of all external interfaces 205 3. A proxy located at the next-hop forwards ICMPv6 type-2 packets 206 received at the next-hop to an Ethernet broadcast address 207 (example ff:ff:ff:ff:ff:ff) on all specified subnets. This was 208 necessitated by router inability (in IPv6) to forward the same 209 packet to multiple unicast next-hops. 211 4. Anycast servers receive the PTB error and process packet as 212 needed. 214 A simple Python scapy script that can perform the ICMPv6 proxy 215 reflection is included. 217 #!/usr/bin/python 219 from scapy.all import * 221 IFACE_OUT = ["p2p1", "p2p2"] 223 def icmp6_callback(pkt): 224 if pkt.haslayer(IPv6) and (ICMPv6PacketTooBig in pkt) \ 225 and pkt[Ether].dst != 'ff:ff:ff:ff:ff:ff': 226 del(pkt[Ether].src) 227 pkt[Ether].dst = 'ff:ff:ff:ff:ff:ff' 228 pkt.show() 229 for iface in IFACE_OUT: 230 sendp(pkt, iface=iface) 232 def main(): 233 sniff(prn=icmp6_callback, filter="icmp6 \ 234 and (ip6[40+0] == 2)", store=0) 236 if __name__ == '__main__': 237 main() 239 This example script listens on all interfaces for IPv6 PTB errors 240 being forwarded using filter-based-forwarding. It removes the 241 existing Ethernet source and rewrites a new Ethernet destination of 242 the Ethernet broadcast address. It then sends the resulting frame 243 out the p2p1 and p2p2 interfaces where our anycast servers reside. 245 4. Improvements 247 There are several ways that improvements could be made to the 248 situation with respect to ECMP load balancing of ICMPv6 PTB. 250 1. Routers with sufficient capacity within the lookup process could 251 parse all the way through the L3 or L4 header in the ICMPv6 252 payload beginning at bit offset 32 of the ICMP header. By 253 reordering the elements of the hash to match the inward direction 254 of the flow, the PTB error could be directed to the same next-hop 255 as the incoming packets in the flow. 257 2. The FIB could be programmed with a multicast distribution tree 258 that included all of the necessary next-hops. 260 3. Ubiquitous implementation of RFC 4821 [RFC4821] Packetization 261 Layer Path MTU Discovery would probably go a long way towards 262 reducing dependence on ICMPv6 PTB. 264 5. Acknowledgements 266 The authors would like to thank Mark Andrews, Brian Carpenter, Nick 267 Hilliard and Ray Hunter, for review. 269 6. IANA Considerations 271 This memo includes no request to IANA. 273 7. Security Considerations 275 The employed mitigation has the potential to greatly amplify the 276 impact of a deliberately malicious sending of ICMPv6 PTB messages. 277 Sensible ingress rate limiting can reduce the potential for impact; 278 however, legitimate traffic may be lost once the rate limit is 279 reached. 281 The proxy replication results in devices not associated with the flow 282 that generated the PTB being recipients of an ICMPv6 message which 283 contains a fragment of a packet. This could arguably result in 284 information disclosure. Recipient machines should be in a common 285 administrative domain. 287 8. Informative References 289 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 290 Discovery", RFC 4821, March 2007. 292 Authors' Addresses 294 Matt Byerly 295 Fastly 296 Kapolei, HI 297 US 299 Email: mbyerly@zynga.com 301 Matt Hite 302 Evernote 303 Redwood City, CA 304 US 306 Email: mhite@hotmail.com 307 Joel Jaeggli 308 Fastly 309 Mountain View, CA 310 US 312 Email: joelja@gmail.com