idnits 2.17.1 draft-ietf-v6ops-pmtud-ecmp-problem-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 28, 2015) is 3226 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'Ether' is mentioned on line 235, but not defined Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 v6ops M. Byerly 3 Internet-Draft Fastly 4 Intended status: Informational M. Hite 5 Expires: December 30, 2015 Evernote 6 J. Jaeggli 7 Fastly 8 June 28, 2015 10 Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB) 11 draft-ietf-v6ops-pmtud-ecmp-problem-03 13 Abstract 15 This document calls attention to the problem of delivering ICMPv6 16 type 2 "Packet Too Big" (PTB) messages to the intended destination in 17 ECMP load balanced or anycast network architectures. It discusses 18 operational mitigations that can be employed to address this class of 19 failures. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on December 30, 2015. 38 Copyright Notice 40 Copyright (c) 2015 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 2. Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 57 3. Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . 4 58 3.1. Alternatives . . . . . . . . . . . . . . . . . . . . . . 5 59 3.2. Implementation . . . . . . . . . . . . . . . . . . . . . 5 60 3.2.1. Alternatives . . . . . . . . . . . . . . . . . . . . 6 61 4. Improvements . . . . . . . . . . . . . . . . . . . . . . . . 7 62 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 63 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 64 7. Security Considerations . . . . . . . . . . . . . . . . . . . 7 65 8. Informative References . . . . . . . . . . . . . . . . . . . 8 66 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 68 1. Introduction 70 Operators of popular Internet services face complex challenges 71 associated with scaling their infrastructure. One approach is to 72 utilize equal-cost multi-path (ECMP) routing to perform stateless 73 distribution of incoming TCP or UDP sessions to multiple servers or 74 to middle boxes such as load balancers. Distribution of traffic in 75 this manner presents a problem when dealing with ICMP signaling. 76 Specifically, an ICMP error is not guaranteed to hash via ECMP to the 77 same destination as its corresponding TCP or UDP session. A case 78 where this is particularly problematic operationally is path MTU 79 discovery (PMTUD). 81 2. Problem 83 A common application for stateless load balancing of TCP or UDP flows 84 is to perform an initial subdivision of flows in front of a stateful 85 load balancer tier or multiple servers so that the workload becomes 86 divided into manageable fractions of the total number of flows. The 87 flow division is performed using ECMP forwarding and a stateless but 88 sticky algorithm for hashing across the available paths. This 89 nexthop selection for the purposes of flow distribution is a 90 constrained form of anycast topology, where all anycast destinations 91 are equidistant from the upstream router responsible for making the 92 last next-hop forwarding decision before the flow arrives on the 93 destination device. In this approach, the hash is performed across 94 some set of available protocol headers. Typically, these headers may 95 include all or a subset of (IPv6) Flow-Label, IP-source, IP- 96 destination, protocol, source-port, destination-port and potentially 97 others such as ingress interface. 99 A problem common to this approach of distribution through hashing is 100 impact on path MTU discovery. An ICMPv6 type 2 PTB message generated 101 on an intermediate device for a packet sent from a server that is 102 part of an ECMP load balanced service to a client will have the load 103 balanced anycast address as the destination and hence will be 104 statelessly load balanced to one of the servers. While the ICMPv6 105 PTB message contains as much of the packet that could not be 106 forwarded as possible, the payload headers are not considered in the 107 forwarding decision and are ignored. Because the PTB message is not 108 identifiable as part of the original flow by the IP or upper layer 109 packet headers, the results of the ICMPv6 ECMP hash calculation are 110 unlikely to be hashed to the same nexthop as packets matching the TCP 111 or UDP ECMP hash of the flow. 113 An example packet flow and topology follow. 115 ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination 117 router --> load balancer 1 ---> 118 \\--> load balancer 2 ---> load-balanced service 119 \--> load balancer N ---> 121 Figure 1 123 The router ECMP decision is used because it is part of the forwarding 124 architecture, can be performed at line rate, and does not depend on 125 shared state or coordination across a distributed forwarding system 126 which may include multiple linecards or routers. The ECMP routing 127 decision is deterministic with respect to packets having the same 128 computed hash. 130 A typical case where ICMPv6 PTB messages are received at the load 131 balancer is a case where the path MTU from the client to the load 132 balancer is limited by a tunnel in which the client itself is not 133 aware of. 135 Direct experience says that the frequency of PTB messages is small 136 compared to total flows. One possible conclusion being that tunneled 137 IPv6 deployments that cannot carry 1500 MTU packets are relatively 138 rare. Techniques employed by clients such as happy-eyeballs may 139 actually contribute some amelioration to the IPv6 client experience 140 by preferring IPv4 in cases that might be identified as failures. 142 Still, the expectation of operators is that PMTUD should work and 143 that unnecessary breakage of client traffic should be avoided. 145 A final observation regarding server tuning is that it is not always 146 possible even if it is potentially desirable to be able to 147 independently set the TCP MSS for different address families on some 148 end-systems. On Linux platforms, advmss may be set on a per route 149 basis for selected destinations in cases where discrimination by 150 route is possible. 152 The problem as described does also impact IPv4; however 153 implementation of RFC 4821 [RFC4821] TCP MTU probing, the ability to 154 fragment on wire at tunnel ingress points and the relative rarity of 155 sub-1500 byte MTUs that are not coupled to changes in client behavior 156 (for example, endpoint VPN clients set the tunnel interface MTU 157 accordingly to avoid fragmentation for performance reasons) makes the 158 problem sufficiently rare that some existing deployments have choosen 159 to ignore it. 161 3. Mitigation 163 Mitigation of the potential for PTB messages to be mis-delivered 164 involves ensuring that an ICMPv6 error message is distributed to the 165 same anycast server responsible for the flow for which the error is 166 generated. Ideally, mitigation could be done by the mechanism hosts 167 use to identify the flow, by looking into the payload of the ICMPv6 168 message (to determine which TCP flow it was associated with) before 169 making a forwarding decision. Because the encapsulated IP header 170 occurs at a fixed offset in the ICMP message it is not outside the 171 realm of possibility that routers with sufficient header processing 172 capability could parse that far into the payload. Employing a 173 mediation device that handles the parsing and distribution of PTB 174 messages after policy routing or on each load-balancer/server is a 175 possibility. 177 Another mitigation approach is predicated upon distributing the PTB 178 message to all anycast servers under the assumption that the one for 179 which the message was intended will be able to match it to the flow 180 and update the route cache with the new MTU and that devices not able 181 to match the flow will discard these packets. Such distribution has 182 potentially significant implications for resource consumption and for 183 self-inflicted denial-of-service if not carefully employed. 184 Fortunately, in real-world deployments we have observed that the 185 number of flows for which this problem occurs is relatively small 186 (example, 10 or fewer pps on 1Gb/s or more worth of https traffic in 187 a real world deployment); sensible ingress rate limiters which will 188 discard excessive message volume can be applied to protect even very 189 large anycast server tiers with the potential for fallout limited to 190 circumstances of deliberate duress. 192 3.1. Alternatives 194 As an alternative, it may be appropriate to lower the TCP MSS to 1220 195 in order to accommodate 1280 byte MTU. We consider this undesirable 196 as hosts may not be able to independently set TCP MSS by address- 197 family thereby impacting IPv4, or alternatively that middle-boxes 198 need to be employed to clamp the MSS independently from the end- 199 systems. Potentially, extension headers might further alter the 200 lower bound that the MSS would have to be set to, making clamping 201 still more undesirable. 203 3.2. Implementation 205 1. Filter-based-forwarding matches next-header ICMPv6 type-2 and 206 matches a next-hop on a particular subnet directly attached to 207 both border routers. The filter is policed to reasonable limits 208 (we chose 1000pps, more conservative rates might be required in 209 other implementations). 211 2. Filter is applied on input side of all external interfaces 213 3. A proxy located at the next-hop forwards ICMPv6 type-2 packets 214 received at the next-hop to an Ethernet broadcast address 215 (example ff:ff:ff:ff:ff:ff) on all specified subnets. This was 216 necessitated by router inability (in IPv6) to forward the same 217 packet to multiple unicast next-hops. 219 4. Anycast servers receive the PTB error and process packet as 220 needed. 222 A simple Python scapy script that can perform the ICMPv6 proxy 223 reflection is included. 225 #!/usr/bin/python 227 from scapy.all import * 229 IFACE_OUT = ["p2p1", "p2p2"] 231 def icmp6_callback(pkt): 232 if pkt.haslayer(IPv6) and (ICMPv6PacketTooBig in pkt) \ 233 and pkt[Ether].dst != 'ff:ff:ff:ff:ff:ff': 234 del(pkt[Ether].src) 235 pkt[Ether].dst = 'ff:ff:ff:ff:ff:ff' 236 pkt.show() 237 for iface in IFACE_OUT: 238 sendp(pkt, iface=iface) 240 def main(): 241 sniff(prn=icmp6_callback, filter="icmp6 \ 242 and (ip6[40+0] == 2)", store=0) 244 if __name__ == '__main__': 245 main() 247 This example script listens on all interfaces for IPv6 PTB errors 248 being forwarded using filter-based-forwarding. It removes the 249 existing Ethernet source and rewrites a new Ethernet destination of 250 the Ethernet broadcast address. It then sends the resulting frame 251 out the p2p1 and p2p2 interfaces which attached to vlans where our 252 anycast servers reside. 254 3.2.1. Alternatives 256 Alternatively, network designs in which a common layer 2 network 257 exists on the ECMP hop could distribute the proxy onto the end 258 systems, eliminating the need for policy routing. They could then 259 rewrite the destination -- for example, using iptables before 260 forwarding the packet back to the network containing all of the 261 server or load balancer interfaces. This implmentation can be done 262 entirely within the Linux iptables firewall. Because of the 263 distributed nature of the filter, more conservative rate limits are 264 required than when a global rate limit can be employed. 266 An example ip6tables / nftables rule to match icmp6 traffic, not 267 match broadcast traffic, impose a rate limit of 10 pps, and pass to a 268 target destination would resemble: 270 ip6tables -I INPUT -i lo -p icmpv6 -m icmpv6 --icmpv6-type 2/0 \ 271 -m pkttype ! --pkt-type broadcast -m limit --limit 10/second \ 272 -j TEE 2001:DB8::1 274 As with the scapy example, once the destination has been rewritten 275 from a hardcoded ND entry to an Ethernet broadcast address -- in this 276 case to an IPv6 documentation address -- the traffic will be 277 reflected to all the hosts on the subnet. 279 4. Improvements 281 There are several ways that improvements could be made to the 282 situation with respect to ECMP load balancing of ICMPv6 PTB. 284 1. Routers with sufficient capacity within the lookup process could 285 parse all the way through the L3 or L4 header in the ICMPv6 286 payload beginning at bit offset 32 of the ICMP header. By 287 reordering the elements of the hash to match the inward direction 288 of the flow, the PTB error could be directed to the same next-hop 289 as the incoming packets in the flow. 291 2. The FIB (Forwarding Information Base) on the router could be 292 programmed with a multicast distribution tree that included all 293 of the necessary next-hops, and ICMPv6 packets could be policy 294 routed to this destination. 296 3. Ubiquitous implementation of RFC 4821 [RFC4821] Packetization 297 Layer Path MTU Discovery would probably go a long way towards 298 reducing dependence on ICMPv6 PTB by end systems. 300 5. Acknowledgements 302 The authors would like to thank Marak Majkowsiki for contributing 303 text, examples, and a very close review. The authors would like to 304 thank Mark Andrews, Brian Carpenter, Nick Hilliard and Ray Hunter, 305 for review. 307 6. IANA Considerations 309 This memo includes no request to IANA. 311 7. Security Considerations 313 The employed mitigation has the potential to greatly amplify the 314 impact of a deliberately malicious sending of ICMPv6 PTB messages. 315 Sensible ingress rate limiting can reduce the potential for impact; 316 however, legitimate traffic may be lost once the rate limit is 317 reached. 319 The proxy replication results in devices not associated with the flow 320 that generated the PTB being recipients of an ICMPv6 message which 321 contains a fragment of a packet. This could arguably result in 322 information disclosure. Recipient machines should be in a common 323 administrative domain. 325 8. Informative References 327 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 328 Discovery", RFC 4821, March 2007. 330 Authors' Addresses 332 Matt Byerly 333 Fastly 334 Kapolei, HI 335 US 337 Email: suckawha@gmail.com 339 Matt Hite 340 Evernote 341 Redwood City, CA 342 US 344 Email: mhite@hotmail.com 346 Joel Jaeggli 347 Fastly 348 Mountain View, CA 349 US 351 Email: joelja@gmail.com