[rrg] Fred's IPv4 PMTUD research: RFC1191 support frequently broken

Robin Whittle <rw@firstpr.com.au> Wed, 03 February 2010 01:53 UTC

Message-ID: <4B68D73C.4040902@firstpr.com.au>
Date: Wed, 03 Feb 2010 12:54:04 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: 'RRG' <rrg@irtf.org>
References: <4B5ED682.8000309@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF 0A64950F33198@XCH-NW-01V.nw.nos.boeing.com> <4B5F8E7E.1090301@firstpr.com.a u> <E1829B60731D1740BB7A0626B4FAF0A64950F332A8@XCH-NW-01V.nw.nos.boeing.co m> <4B5FC783.4030401@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64950F 3333F@XCH-NW-01V.nw.nos.boeing.com> <4B6103C8.6090307@firstpr.com.au><4B613 BD3.5080605@tony.li><4B616A7C.5010106@firstpr.com.au><108701caa02a$acc431d0$c2f0200a@cisco.com> <C304DB494AC0C04C87C6A6E2FF5603DB47DBE8D56A@NDJSSCC01.ndc.nasa.gov> <E1829B60731D1740BB7A0626B4FAF0A64950FECA25@XCH-NW-01V.nw.nos.boeing.com>
In-Reply-To: <E1829B60731D1740BB7A0626B4FAF0A64950FECA25@XCH-NW-01V.nw.nos.boeing.com>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Cc: "Eddy, Wesley M. (GRC-MS00)[ASRC AEROSPACE CORP]" <wesley.m.eddy@nasa.gov>
Subject: [rrg] Fred's IPv4 PMTUD research: RFC1191 support frequently broken
Precedence: list

  This is a response to Fred's message:

  Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware
  http://www.ietf.org/mail-archive/web/rrg/current/msg05907.html

  in which he shows that the failure of RFC1191 PMTUD is far
  more prevalent than I had imagined:

     Some servers sending DF=0 packets (eg. 1470 bytes).

     Presumed dropping of PTBs in some networks.

     Some servers failing to respond properly or at all to PTBs.

     Perhaps tunnels which do not generate PTBs to the sending
     host (which may be the entry router of an outer tunnel).

  It seems the Internet is getting by with dodgy assumptions about
  packet sizes which enable everything to work (mostly) despite
  these failings.

  I can't see how we could transition to jumboframes in the DFZ
  with the current messy situation.

  Does this justify retreat into the development of RFC4821
  packetization layer PMTUD?  That would seem to be a capitulation
  to the bad practices which have undermined RFC1191 PMTUD,
  allowing these practices to continue and become more prevalent.
  Yet RFC4821 would would only solve the problem once all stacks
  and all applications were modified accordingly.

  How can an encapsulation-based CES architecture cope with
  these problems, assuming the encapsulation overhead leads
  to PMTU problems between the ITR and ETR which are not
  properly handled with a PTB message?  It can be done, but
  it is complex - relying on ITR-to-ETR checking that
  packets actually arrive at the ETR, rather than relying on
  PTBs.

  IPv6 would probably have similar problems - and perhaps
  more tunnels at present - than IPv4.

Hi Fred,

Replying to Wesley Eddy's message (msg05860), you wrote:

>>>> Where is the evidence that PTB filtering is ever more than a
>>>> transitory, mistaken, condition?
>>> Sounds like you want a research paper?
>>
>> Here are three for Robin's sake; this issue is well known to the
>> transport layer folks:
>>
>> http://www.icir.org/tbit/tbit-Aug2004.pdf
>>
>> http://www.caida.org/workshops/wide/0603/slides/mluckie2.pdf
>>
>> http://listserver.internetnz.net.nz/pipermail/ipv6-techsig/2009-October/000708.html
> 
> Thanks for pointers to this work. Robin pointed out some
> of this on the RRG list earlier, but its good to have the
> additional material.
> 
> From Ben Stasiewicz' post on the ipv6-techsig mailing list
> (above), I retrieved the Alexa list of the top 1M websites
> and from this took the top 1000 as my sample set. I used
> the "tbit" tool (http://icir.net/tbit) to run IPv4 Path MTU
> discovery (PMTUD) tests on each of the top 1000 using the
> command syntax:
> 
>   # tbit -m 1460 -M 576 -n www.example.com -t PMTUD example.com  
> 
> This command has tbit advertise an MSS of 1460 during the
> TCP connection, but whenever tbit receives a packet larger
> than 576 from a website it drops the packet and sends back
> an ICMPv4 "Fragmentation Needed" (aka "PMTUD") message.
> 
> The test was to determine how each website in the top 1000
> responded to PMTUD messages. Of the websites sampled, I
> observed the following results:
> 
>   Transfer Fail:   249 (24.9%)
>   PMTUD Disabled:  109 (10.9%)
>   PMTUD Success:   394 (39.4%)
>   PMTUD Fail:      196 (19.6%)
>   Connection Fail:  52 ( 5.2%)  
> 
> The "Transfer Fail" class was mostly due to HTTP 301/302
> responses (tbit did not know how to handle these), and in
> those cases tbit quit before any PMTUD tests were run. I
> otherwise did not make any effort to diagnose the failures
> more closely.
> 
> The "PMTUD Disabled" class included websites that set DF=0
> in the IPv4 header, and thereby disabled PMTUD making data
> packets eligible for fragmentation in the network.

Google did this this last time I looked, with 1470 byte packets:

http://www.firstpr.com.au/ip/ivip/ipv4-bits/actual-packets.html#google-no-pmtud

Trying again, using a Google logo image file, from a US server:

  tcpdump -n -x  -i eth0 > tcpdump-01.txt
  wget http://74.125.159.105/images/google_sm.gif

    IP 74.125.159.105.http > myhost.48611: ....

    0x0000:  4500 05be 7a19 0000 3806 c132 4a7d 9f69
    0x0010:  ae8f a978 0050 bde3 22f2 47ec d1c4 152f
    0x0020:  8010 006a 7ad2 0000 0101 080a 2570 5989

    0x0000:  4500 05be 7a19 0000
                  ----      ^
           1470 bytes        \ DF=0

the same thing is occurring.

> The "PMTUD Success" cases correctly implemented PMTUD, but
> in many cases tbit needed to send multiple PMTUD messages
> before the website reduced the size of its data packets.
> In some cases, websites responded to the PMTUD messages in
> strange ways, including making curious guesses at packet
> sizes instead of honoring the MTU size advertised in the
> PMTUD message, turning PMTUD off altogether after receiving
> a few PMTUD messages, etc.
> 
> The "PMTUD Failed" cases represented true failures of the
> website to correctly honor the PMTUD messages, as verified
> by wireshark captures. This could only happen if the PMTUD
> messages were filtered in the network and not delivered to
> the website, or if the website received the PMTUD messages
> and ignored them.
> 
> The "Connection Fail" cases represented websites that
> were unreachable, but may have also included sites that
> sent large packets that were silently dropped due to a
> true MTU bottleneck before reaching the test machine.
> 
> All of this closely agrees with the results documented
> in "Measuring the Evolution of Transport Protocols in
> the Internet" (Medina, Allman, and Floyd, ACM Computer
> Communication Review, April 2005):
> 
>  http://icir.net/tbit/TCPevolution-Mar2005.pdf
> 
> The end result is that the IPv4 Internet still seems
> to be vulnerable to PMTUD failures along paths that
> lead to the top websites in the Internet today. This
> suggests that PMTUD is likely not being invoked very
> often in the first place since most links in the Internet
> configure an MTU of 1500 or larger and few applications
> send packets larger than that. However, as more and more
> tunnels over the IPv4 Internet are used the frequency
> of PMTUD failures is likely to increase as well.

Thanks for doing this research.  It indicates that PMTUD is a mess in
the IPv4 Internet - and yet in general things seem to work OK.

As long as almost all hosts are reachable with PMTUs of 1500 or
somewhat below, and as long as hosts generally do no attempt to send
out packets no longer than 1470 if they are not supporting RFC1191
PMTUD, I guess things generally work OK.

But this presents real difficulties for introducing any CES
architecture which uses encapsulation, assuming that these problems
with PTBs being dropped, or not generated in tunnels, continue and
affect the paths between ITRs and ETRs.

Hosts sending 1470 byte DF=0 packets is a much tougher nut to crack,
since a CES ITR simply cannot tunnel this in a single packet.  The
ITR can't tell the host to send shorter packets.  So each such packet
would have to travel as two packets in the ITR ETR tunnel.  This
would be extremely ugly.

If a CES scheme was introduced, it would probably need to be
accompanied by a BCP (and a pep-talk to Google) to the effect that it
really is not OK any more to send DF=0 packets which are longer than
can be handled by the encapsulation scheme.

In the long-term, I would prefer to see DF=0 packets no longer used -
and these problems of PTBs being dropped, or not generated in
tunnels, or ignored, fixed - rather than every host and application
having to use the messy and potentially error-prone RFC4821 approach
to PMTUD.

  - Robin

[rrg] RANGER and SEAL critique Robin Whittle
Re: [rrg] RANGER and SEAL critique Templin, Fred L
Re: [rrg] RANGER and SEAL critique Robin Whittle
Re: [rrg] RANGER and SEAL critique Templin, Fred L
Re: [rrg] RANGER and SEAL critique Robin Whittle
Re: [rrg] RANGER and SEAL critique Templin, Fred L
Re: [rrg] RANGER and SEAL critique Templin, Fred L
[rrg] SEAL critique, PMTUD, RFC4821 = vapourware Robin Whittle
Re: [rrg] RANGER and SEAL critique Robin Whittle
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Tony Li
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Robin Whittle
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Dan Wing
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Robin Whittle
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Templin, Fred L
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Robin Whittle
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Eddy, Wesley M. (GRC-MS00)[ASRC AEROSPACE CORP]
Re: [rrg] RANGER and SEAL critique Robin Whittle
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Templin, Fred L
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Templin, Fred L
[rrg] Fred's IPv4 PMTUD research: RFC1191 support… Robin Whittle
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Christopher Morrow
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Xu Xiaohu
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Christopher Morrow
Re: [rrg] Fred's IPv4 PMTUD research: RFC1191 sup… Christopher Morrow
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Robin Whittle
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Templin, Fred L
Re: [rrg] RANGER and SEAL critique Templin, Fred L
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Robin Whittle
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Templin, Fred L
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Robin Whittle
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Matthew Luckie
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Templin, Fred L
Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourw… Robin Whittle
Re: [rrg] RANGER and SEAL critique Tony Li
Re: [rrg] RANGER and SEAL critique Templin, Fred L