idnits 2.17.1 

draft-ietf-v6ops-pmtud-ecmp-problem-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 28, 2015) is 3226 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'Ether' is mentioned on line 235, but not defined


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	v6ops                                                          M. Byerly
3	Internet-Draft                                                    Fastly
4	Intended status: Informational                                   M. Hite
5	Expires: December 30, 2015                                      Evernote
6	                                                              J. Jaeggli
7	                                                                  Fastly
8	                                                           June 28, 2015

10	 Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB)
11	                 draft-ietf-v6ops-pmtud-ecmp-problem-03

13	Abstract

15	   This document calls attention to the problem of delivering ICMPv6
16	   type 2 "Packet Too Big" (PTB) messages to the intended destination in
17	   ECMP load balanced or anycast network architectures.  It discusses
18	   operational mitigations that can be employed to address this class of
19	   failures.

21	Status of This Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on December 30, 2015.

38	Copyright Notice

40	   Copyright (c) 2015 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.  Code Components extracted from this document must
49	   include Simplified BSD License text as described in Section 4.e of
50	   the Trust Legal Provisions and are provided without warranty as
51	   described in the Simplified BSD License.

53	Table of Contents

55	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
56	   2.  Problem . . . . . . . . . . . . . . . . . . . . . . . . . . .   2
57	   3.  Mitigation  . . . . . . . . . . . . . . . . . . . . . . . . .   4
58	     3.1.  Alternatives  . . . . . . . . . . . . . . . . . . . . . .   5
59	     3.2.  Implementation  . . . . . . . . . . . . . . . . . . . . .   5
60	       3.2.1.  Alternatives  . . . . . . . . . . . . . . . . . . . .   6
61	   4.  Improvements  . . . . . . . . . . . . . . . . . . . . . . . .   7
62	   5.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   7
63	   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
64	   7.  Security Considerations . . . . . . . . . . . . . . . . . . .   7
65	   8.  Informative References  . . . . . . . . . . . . . . . . . . .   8
66	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   8

68	1.  Introduction

70	   Operators of popular Internet services face complex challenges
71	   associated with scaling their infrastructure.  One approach is to
72	   utilize equal-cost multi-path (ECMP) routing to perform stateless
73	   distribution of incoming TCP or UDP sessions to multiple servers or
74	   to middle boxes such as load balancers.  Distribution of traffic in
75	   this manner presents a problem when dealing with ICMP signaling.
76	   Specifically, an ICMP error is not guaranteed to hash via ECMP to the
77	   same destination as its corresponding TCP or UDP session.  A case
78	   where this is particularly problematic operationally is path MTU
79	   discovery (PMTUD).

81	2.  Problem

83	   A common application for stateless load balancing of TCP or UDP flows
84	   is to perform an initial subdivision of flows in front of a stateful
85	   load balancer tier or multiple servers so that the workload becomes
86	   divided into manageable fractions of the total number of flows.  The
87	   flow division is performed using ECMP forwarding and a stateless but
88	   sticky algorithm for hashing across the available paths.  This
89	   nexthop selection for the purposes of flow distribution is a
90	   constrained form of anycast topology, where all anycast destinations
91	   are equidistant from the upstream router responsible for making the
92	   last next-hop forwarding decision before the flow arrives on the
93	   destination device.  In this approach, the hash is performed across
94	   some set of available protocol headers.  Typically, these headers may
95	   include all or a subset of (IPv6) Flow-Label, IP-source, IP-
96	   destination, protocol, source-port, destination-port and potentially
97	   others such as ingress interface.

99	   A problem common to this approach of distribution through hashing is
100	   impact on path MTU discovery.  An ICMPv6 type 2 PTB message generated
101	   on an intermediate device for a packet sent from a server that is
102	   part of an ECMP load balanced service to a client will have the load
103	   balanced anycast address as the destination and hence will be
104	   statelessly load balanced to one of the servers.  While the ICMPv6
105	   PTB message contains as much of the packet that could not be
106	   forwarded as possible, the payload headers are not considered in the
107	   forwarding decision and are ignored.  Because the PTB message is not
108	   identifiable as part of the original flow by the IP or upper layer
109	   packet headers, the results of the ICMPv6 ECMP hash calculation are
110	   unlikely to be hashed to the same nexthop as packets matching the TCP
111	   or UDP ECMP hash of the flow.

113	   An example packet flow and topology follow.

115	   ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination

117	     router --> load balancer 1 --->
118	          \\--> load balancer 2 ---> load-balanced service
119	           \--> load balancer N --->

121	                                 Figure 1

123	   The router ECMP decision is used because it is part of the forwarding
124	   architecture, can be performed at line rate, and does not depend on
125	   shared state or coordination across a distributed forwarding system
126	   which may include multiple linecards or routers.  The ECMP routing
127	   decision is deterministic with respect to packets having the same
128	   computed hash.

130	   A typical case where ICMPv6 PTB messages are received at the load
131	   balancer is a case where the path MTU from the client to the load
132	   balancer is limited by a tunnel in which the client itself is not
133	   aware of.

135	   Direct experience says that the frequency of PTB messages is small
136	   compared to total flows.  One possible conclusion being that tunneled
137	   IPv6 deployments that cannot carry 1500 MTU packets are relatively
138	   rare.  Techniques employed by clients such as happy-eyeballs may
139	   actually contribute some amelioration to the IPv6 client experience
140	   by preferring IPv4 in cases that might be identified as failures.

142	   Still, the expectation of operators is that PMTUD should work and
143	   that unnecessary breakage of client traffic should be avoided.

145	   A final observation regarding server tuning is that it is not always
146	   possible even if it is potentially desirable to be able to
147	   independently set the TCP MSS for different address families on some
148	   end-systems.  On Linux platforms, advmss may be set on a per route
149	   basis for selected destinations in cases where discrimination by
150	   route is possible.

152	   The problem as described does also impact IPv4; however
153	   implementation of RFC 4821 [RFC4821] TCP MTU probing, the ability to
154	   fragment on wire at tunnel ingress points and the relative rarity of
155	   sub-1500 byte MTUs that are not coupled to changes in client behavior
156	   (for example, endpoint VPN clients set the tunnel interface MTU
157	   accordingly to avoid fragmentation for performance reasons) makes the
158	   problem sufficiently rare that some existing deployments have choosen
159	   to ignore it.

161	3.  Mitigation

163	   Mitigation of the potential for PTB messages to be mis-delivered
164	   involves ensuring that an ICMPv6 error message is distributed to the
165	   same anycast server responsible for the flow for which the error is
166	   generated.  Ideally, mitigation could be done by the mechanism hosts
167	   use to identify the flow, by looking into the payload of the ICMPv6
168	   message (to determine which TCP flow it was associated with) before
169	   making a forwarding decision.  Because the encapsulated IP header
170	   occurs at a fixed offset in the ICMP message it is not outside the
171	   realm of possibility that routers with sufficient header processing
172	   capability could parse that far into the payload.  Employing a
173	   mediation device that handles the parsing and distribution of PTB
174	   messages after policy routing or on each load-balancer/server is a
175	   possibility.

177	   Another mitigation approach is predicated upon distributing the PTB
178	   message to all anycast servers under the assumption that the one for
179	   which the message was intended will be able to match it to the flow
180	   and update the route cache with the new MTU and that devices not able
181	   to match the flow will discard these packets.  Such distribution has
182	   potentially significant implications for resource consumption and for
183	   self-inflicted denial-of-service if not carefully employed.
184	   Fortunately, in real-world deployments we have observed that the
185	   number of flows for which this problem occurs is relatively small
186	   (example, 10 or fewer pps on 1Gb/s or more worth of https traffic in
187	   a real world deployment); sensible ingress rate limiters which will
188	   discard excessive message volume can be applied to protect even very
189	   large anycast server tiers with the potential for fallout limited to
190	   circumstances of deliberate duress.

192	3.1.  Alternatives

194	   As an alternative, it may be appropriate to lower the TCP MSS to 1220
195	   in order to accommodate 1280 byte MTU.  We consider this undesirable
196	   as hosts may not be able to independently set TCP MSS by address-
197	   family thereby impacting IPv4, or alternatively that middle-boxes
198	   need to be employed to clamp the MSS independently from the end-
199	   systems.  Potentially, extension headers might further alter the
200	   lower bound that the MSS would have to be set to, making clamping
201	   still more undesirable.

203	3.2.  Implementation

205	   1.  Filter-based-forwarding matches next-header ICMPv6 type-2 and
206	       matches a next-hop on a particular subnet directly attached to
207	       both border routers.  The filter is policed to reasonable limits
208	       (we chose 1000pps, more conservative rates might be required in
209	       other implementations).

211	   2.  Filter is applied on input side of all external interfaces

213	   3.  A proxy located at the next-hop forwards ICMPv6 type-2 packets
214	       received at the next-hop to an Ethernet broadcast address
215	       (example ff:ff:ff:ff:ff:ff) on all specified subnets.  This was
216	       necessitated by router inability (in IPv6) to forward the same
217	       packet to multiple unicast next-hops.

219	   4.  Anycast servers receive the PTB error and process packet as
220	       needed.

222	   A simple Python scapy script that can perform the ICMPv6 proxy
223	   reflection is included.

225	         #!/usr/bin/python

227	         from scapy.all import *

229	         IFACE_OUT = ["p2p1", "p2p2"]

231	         def icmp6_callback(pkt):
232	             if pkt.haslayer(IPv6) and (ICMPv6PacketTooBig in pkt) \
233	             and pkt[Ether].dst != 'ff:ff:ff:ff:ff:ff':
234	                 del(pkt[Ether].src)
235	                 pkt[Ether].dst = 'ff:ff:ff:ff:ff:ff'
236	                 pkt.show()
237	                 for iface in IFACE_OUT:
238	                     sendp(pkt, iface=iface)

240	         def main():
241	             sniff(prn=icmp6_callback, filter="icmp6 \
242	             and (ip6[40+0] == 2)", store=0)

244	         if __name__ == '__main__':
245	             main()

247	   This example script listens on all interfaces for IPv6 PTB errors
248	   being forwarded using filter-based-forwarding.  It removes the
249	   existing Ethernet source and rewrites a new Ethernet destination of
250	   the Ethernet broadcast address.  It then sends the resulting frame
251	   out the p2p1 and p2p2 interfaces which attached to vlans where our
252	   anycast servers reside.

254	3.2.1.  Alternatives

256	   Alternatively, network designs in which a common layer 2 network
257	   exists on the ECMP hop could distribute the proxy onto the end
258	   systems, eliminating the need for policy routing.  They could then
259	   rewrite the destination -- for example, using iptables before
260	   forwarding the packet back to the network containing all of the
261	   server or load balancer interfaces.  This implmentation can be done
262	   entirely within the Linux iptables firewall.  Because of the
263	   distributed nature of the filter, more conservative rate limits are
264	   required than when a global rate limit can be employed.

266	   An example ip6tables / nftables rule to match icmp6 traffic, not
267	   match broadcast traffic, impose a rate limit of 10 pps, and pass to a
268	   target destination would resemble:

270	       ip6tables -I INPUT -i lo -p icmpv6 -m icmpv6 --icmpv6-type 2/0 \
271	       -m pkttype ! --pkt-type broadcast -m limit --limit 10/second \
272	       -j TEE 2001:DB8::1

274	   As with the scapy example, once the destination has been rewritten
275	   from a hardcoded ND entry to an Ethernet broadcast address -- in this
276	   case to an IPv6 documentation address -- the traffic will be
277	   reflected to all the hosts on the subnet.

279	4.  Improvements

281	   There are several ways that improvements could be made to the
282	   situation with respect to ECMP load balancing of ICMPv6 PTB.

284	   1.  Routers with sufficient capacity within the lookup process could
285	       parse all the way through the L3 or L4 header in the ICMPv6
286	       payload beginning at bit offset 32 of the ICMP header.  By
287	       reordering the elements of the hash to match the inward direction
288	       of the flow, the PTB error could be directed to the same next-hop
289	       as the incoming packets in the flow.

291	   2.  The FIB (Forwarding Information Base) on the router could be
292	       programmed with a multicast distribution tree that included all
293	       of the necessary next-hops, and ICMPv6 packets could be policy
294	       routed to this destination.

296	   3.  Ubiquitous implementation of RFC 4821 [RFC4821] Packetization
297	       Layer Path MTU Discovery would probably go a long way towards
298	       reducing dependence on ICMPv6 PTB by end systems.

300	5.  Acknowledgements

302	   The authors would like to thank Marak Majkowsiki for contributing
303	   text, examples, and a very close review.  The authors would like to
304	   thank Mark Andrews, Brian Carpenter, Nick Hilliard and Ray Hunter,
305	   for review.

307	6.  IANA Considerations

309	   This memo includes no request to IANA.

311	7.  Security Considerations

313	   The employed mitigation has the potential to greatly amplify the
314	   impact of a deliberately malicious sending of ICMPv6 PTB messages.
315	   Sensible ingress rate limiting can reduce the potential for impact;
316	   however, legitimate traffic may be lost once the rate limit is
317	   reached.

319	   The proxy replication results in devices not associated with the flow
320	   that generated the PTB being recipients of an ICMPv6 message which
321	   contains a fragment of a packet.  This could arguably result in
322	   information disclosure.  Recipient machines should be in a common
323	   administrative domain.

325	8.  Informative References

327	   [RFC4821]  Mathis, M. and J. Heffner, "Packetization Layer Path MTU
328	              Discovery", RFC 4821, March 2007.

330	Authors' Addresses

332	   Matt Byerly
333	   Fastly
334	   Kapolei, HI
335	   US

337	   Email: suckawha@gmail.com

339	   Matt Hite
340	   Evernote
341	   Redwood City, CA
342	   US

344	   Email: mhite@hotmail.com

346	   Joel Jaeggli
347	   Fastly
348	   Mountain View, CA
349	   US

351	   Email: joelja@gmail.com