Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware

Robin Whittle <rw@firstpr.com.au> Wed, 03 February 2010 14:16 UTC

Return-Path: <rw@firstpr.com.au>
X-Original-To: rrg@core3.amsl.com
Delivered-To: rrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 5F4BE28C168 for <rrg@core3.amsl.com>; Wed, 3 Feb 2010 06:16:34 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.65
X-Spam-Level:
X-Spam-Status: No, score=-1.65 tagged_above=-999 required=5 tests=[AWL=0.245, BAYES_00=-2.599, HELO_EQ_AU=0.377, HOST_EQ_AU=0.327]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qKTFtkeO7KwS for <rrg@core3.amsl.com>; Wed, 3 Feb 2010 06:16:32 -0800 (PST)
Received: from gair.firstpr.com.au (gair.firstpr.com.au [150.101.162.123]) by core3.amsl.com (Postfix) with ESMTP id 96A1C28C155 for <rrg@irtf.org>; Wed, 3 Feb 2010 06:16:31 -0800 (PST)
Received: from [10.0.0.6] (wira.firstpr.com.au [10.0.0.6]) by gair.firstpr.com.au (Postfix) with ESMTP id 5ADE0175B8A; Thu, 4 Feb 2010 01:17:12 +1100 (EST)
Message-ID: <4B698565.8030301@firstpr.com.au>
Date: Thu, 04 Feb 2010 01:17:09 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: "Templin, Fred L" <Fred.L.Templin@boeing.com>
References: <4B5ED682.8000309@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF 0A64950F33198@XCH-NW-01V.nw.nos.boeing.com> <4B5F8E7E.1090301@firstpr.com. a u> <E1829B60731D1740BB7A0626B4FAF0A64950F332A8@XCH-NW-01V.nw.nos.boeing. c om> <4B5FC783.4030401@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A649 50F3333F@XCH-NW-01V.nw.nos.boeing.com> <4B6103C8.6090307@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64950FEC1D3@XCH-NW-01V.nw.nos.boeing.com> <4B6473E5.1000508@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64950FEC98C@XCH-NW-01V.nw.nos.boeing.com>
In-Reply-To: <E1829B60731D1740BB7A0626B4FAF0A64950FEC98C@XCH-NW-01V.nw.nos.boeing.com>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Cc: RRG <rrg@irtf.org>
Subject: Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Feb 2010 14:16:34 -0000

Short version:    Refining my understanding of Fred's SEAL
                  protocol for tunneling with PMTUD management.

Hi Fred,

I will include only parts of my previous message to which you are
replying.

>>>> IPTM doesn't rely on the PTB. (See below for how it will be able to
>>>> work with minimal length IPv4 PTBs.)  As long as a PTB does get to
>>>> the ITR - which it would in most cases - then the ITR knows about the
>>>> MTU problem without having to wait for the ETR to time out and send a
>>>> message to the ITR saying the big packet did not arrive.  Also, the
>>>> ITR gets an exact MTU value from this PTB, rather than having to do
>>>> what SEAL does - hunt back and forth to find a packet size which is
>>>> reliably delivered without MTU problems.

>>> SEAL doesn't hunt back and forth.

>> 4.3.9.1.2 mentions an "iterative searching strategy" - which sounds
>> like a fancy term for "hunt"!  This occurs only in IPv4 when the ETE
>> gets a first fragment shorter than 576 bytes, then this is
>> interpreted as a "runt fragment" and so is not regarded as a true
>> measurement of the limiting MTU.
> 
> Yes, this follows directly from the RFC1191 "plateau
> table" approach to guessing MTUs when old routers fail
> to fill in the MTU field in their PTB messages. 

OK - an iterative search strategy which only works downwards is not
"hunting"!

Are there really routers in use which don't include the 16 bit value?

Even if there theoretically are not, SEAL still has to cope with the
possibility of zero or unreasonable values in this MTU field.

> What I
> meant was that yes, SEAL will hunt "back" when there is
> a router in the path that does weird fragmentation. But,
> it does not hunt "forth".

OK - I understand that SEAL follows "an iterative searching strategy
that parallels (Section 5 of RFC1191)":

  http://tools.ietf.org/html/rfc1191#section-5

presumably the final part of this, which leads to:

  http://tools.ietf.org/html/rfc1191#section-7

I understand this means stepping downwards through the values in
Table 7-1.  However, I guess you would choose some new values to add
to this list



>> In this application of SEAL, I understand there is no need for any
>> mid-layer protocol between the IPv4 or IPv6 header and the SEAL
>> header, or between the SEAL header and the traffic packet.  This is
>> not clearly specified anywhere, since the SEAL and RANGER documents
>> are general purpose, and their use for a scalable routing solution as
>> a Core-Edge Separation architecture is only one thing they could be
>> used for.
> 
> In some environments, it may be necessary to insert a
> mid-layer UDP header in order to give ECMP/LAG routers
> a handle to support multipath traffic flow separation.

   http://en.wikipedia.org/wiki/Equal-cost_multi-path_routing

http://www.force10networks.com/CSPortal20/TechTips/0065_HowDoIConfigureLoadBalancing.aspx

As far as I know, these techniques are not something to consider with
the RANGER CES, or with LISP or Ivip.  If the routers can handle
ordinary traffic packets they can handle encapsulated packets too.  I
haven't read about these techniques in detail.  I guess that within
RANGER, beyond its use as a CES scalable routing solution, you may
want to support ECMP and LAG.



>> Firstly I describe my understanding of what your ID specifies for IPv4.
>>
>> Secondly I describe two other ways you might do PMTUD with IPv4,
>> without using DF=0 packets.  These would avoid whatever risk there
>> might be of setting the ITE's PMTU estimate too low due to a limiting
>> router sending out fragments which are shorter than the limiting next
>> hop MTU.
>>
>> Finally, I describe my understanding of what your ID specifies for IPv6.
>>
>> This is partly for my own reference, since it took me many hours to
>> discern this by reading the SEAL ID and corresponding with you.
>>
>>
>> IPv4:
>>
>>   The ITE sends a DF=0 packet into the tunnel.  This starts with
>>   an IPv4 header, then has a SEAL header (there's no mid-level
>>   protocol in this Core-Edge Separation usage of RANGER and SEAL)
>>   and then the inner packet, the original IPv4 traffic packet.
>>
>>   The source address in the outer header is that of the ITE and
>>   the destination address is that of the ETE.  The 32 bit
>>   SEAL ID is split in two.  16 bits go into the IPv4 header's
>>   ID field and 16 into the SEAL header's ID Extension field.
>>
>>   The limiting router in the tunnel (the one where the next-hop
>>   MTU is less than the the length of this whole packet) fragments
>>   it into at least two fragments.
>>
>>   Now the second para in 4.4.2 comes into play:
>>
>>         When the ETE processes the IP first-fragment (i.e.,
>>         one with MF=1 and Offset=0 in the IP header) of a
>>         fragmented SEAL packet, ...
>>
>>   The first para was for reassembling packets which had been
>>   fragmented by the SEAL protocol.  But the second para is for
>>   SEAL packets, as was just sent, being fragmented by a
>>   router between the ITE and the ETE.  This only occurs for
>>   IPv4 and I think it would be helpful to mention IPv4 in this
>>   paragraph.  Maybe it needs its own section.
>>
>>        ...  it sends a "Reassembly Report - Fragmentation
>>        Experienced" message back to the ITE with the S_MSS field
>>        set to the length of the first-fragment and with the
>>        S_MRU field set to no more than the size of the reassembly
>>        buffer (see Section 4.4.5).
>>
>>   I think this last part about the value of S_MRU is not clear
>>   enough.  What value should it be set to?
>>
>>   I will assume it is set to some non-zero value.
>>
>>   Assuming the limiting router sent out the first fragment with
>>   a length equal to the limiting next-hop MTU, then this MTU
>>   value is now in the S_MSS field of the message sent to the
>>   ITE.
>>
>>
>>   This message arrives at the ITE.  This message, according to
>>   Figure 4, contains:
>>
>>        As much of invoking packet as possible without the
>>        message exceeding 576 bytes.
>>
>>   Maybe your ID specifies this, but I am having trouble
>>   following it - there has to be a way the ITE securely
>>   accepts this "Fragmentation Experienced" message.
>>
>>   As far as I know, the ITE looks into the message, finding
>>   the initial part of the packet which the ETE received as
>>   a first fragment.  That will contain the outer IPv4 header
>>   and the SEAL header, and from this these the 32 bit SEAL ID
>>   in the SEAL encapsulated packet can be found.
>>
>>   I think you either cache the recently sent 32 bit SEAL IDs
>>   or maintain a sliding window function over their range so
>>   you can easily identify a value which was used in the last
>>   second or two.  In a given ITE, each ETE has its own SEAL
>>   ID counter.  Its value is intitialized randomly when the
>>   state for this ETE is created.  After that, its value
>>   increments with each each packet sent to the ETE.
>>
>>   (I may adopt this incrementing value per ETR arrangement,
>>   with its sliding window, rather than using a nonce.)
>>
>>   The wider the window in time, the longer you can accept these
>>   messages.  Since the ETE and the ITE could be on opposite
>>   sides of the Net, I guess you need to have a window which
>>   accepts SEAL IDs sent at least a second ago.
>>
>>   The longer the window in time, and the more packets the
>>   ITE sends to this ITE, the wider the window is numerically
>>   and the easier it is for an attacker to guess a valid value
>>   and have the ETE accept a PTB with a low enough value
>>   to cause lost efficiency - for the next 10 minutes or so.
> 
> The above is all correct wrt the window management. The
> ITE can ensure that the window size remains bounded by
> sending periodic explicit probes (e.g., once explicit
> probe per every N data packets).

I don't have a really clear idea of how SEAL sets the numeric window,
or what you mean by your second sentence above.  If you can give a
more complete description with an example, I would really appreciate
it - in part because I might want to use or adapt your technique for
Ivip, which currently uses nonces.



>>   Now to 4.3.9.1.2:
>>
>>        4.3.9.1.2. Fragmentation Experienced (Code=1)
>>
>>        If the value in the S_MRU field is non-zero, the
>>        ITE records the value in its soft state for this ETE.
>>
>>   This means this value is stored in the S_MRU variable for
>>   this ETE, as defined in 4.3.3.  As noted above, I am not
>>   clear on what value was written into this field of the
>>   report by the ETE.
>>
>>        The ITE then adjusts the S_MSS value in its soft state.
>>
>>   This means this value is stored in the S_MSS variable for
>>   this ETE, as defined in 4.3.3, subject to the instructions
>>   in the next few sentences.
>>
>>   I am a bit confused about the differing roles of these two
>>   variables.
>>
>>        If the S_MSS value in the Reassembly Report is greater
>>        than 576 (i.e., the nominal minimum MTU for IPv4 links),
>>        the ITE records this new value in its soft state.
>>
>>   OK - this is based on the assumption that the length of the
>>   first fragment received by the ETE reflects the limiting
>>   MTU of the ITE to ETE path.
>>
>>        If the S_MSS value in the report is less than the current
>>        soft state value and also less than 576,
>>
>>   How could the ITE's S_MSS value for this ETE be less than
>>   576?  I can't see how.  If it can't be, then the first part
>>   of the above sentence may be redundant.
> 
> No, the sentence is correct. It is possible for the ITE
> to need to reduce its cached S_MSS value to a size less
> than 576 if there is truly a link with a small MTU (e.g.,
> 256) on the path. Although 576 is often considered to
> be the "nominal" minimum MTU for IPv4 links, the actual
> minimum MTU is only 68 bytes per RFC791.

OK - I guess I wouldn't go this far for Ivip.  I will probably have
some guidance that if anyone has routers, tunnels or whatever with
PMTUs less than 1280 or some figure in that range, for both IPv4 or
IPv6, then they should make sure that ITRs or ETRs are not located so
that such low PMTU parts of the network are between the DFZ and these
ITRs or ETRs.



>> OK - but at some point we need to stop adopting band-aid measures
>> like artificially limiting MSS or MTU values.  That just lets the PTB
>> filtering and lousy tunnels be less noticed.   We should not be
>> trying to upgrade the stacks of all hosts in the world because a few
>> end-user networks filter PTBs or ISPs and perhaps end-user networks
>> run tunnels which don't support the otherwise perfectly good RFC 1191
>> / 1981 PMTUD techniques.
>>
>> We would just be heaping limitations and complications on ourselves
>> in an overly-defensive, expensive and inefficient attempt to cope
>> with failure of a few ISPs and end-user networks to run the Internet
>> as it needs to be run.  We are paying the ISPs.  The end-user
>> networks which are filtering PTBs are disrupting a subset of their
>> own communications.
> 
> I still think there are problems out there. I will post
> another message on this soon.

Indeed you did - I had no idea things were this bad.

  http://www.ietf.org/mail-archive/web/rrg/current/msg05907.html
  http://www.ietf.org/mail-archive/web/rrg/current/msg05910.html

>> I just think it wrong in principle to develop messy new protocols
>> such as RFC 4821 to cope with these failings.
> 
> In my opinion, packetization layers are operating "at risk"
> if they use packet sizes larger than 1500 but are not in
> some way checking with the final destination to ensure that
> the big packets are actually getting through. RFC4821 is
> a method for the source to do just that without requiring
> any changes on the destination. But to be sure, SEAL does
> not *depend* on RFC4821 but rather *sets the stage* for
> RFC4821 and/or any functional equivalents.

OK.  I need to revisit all my thinking on PMTUD given the results of
your recent research.  But I haven't yet changed my thinking that RFC
4821 is messy and expensive, and that it would be better to
straighten out whatever it is within the network which is stopping
PMTUD from working correctly.

 - Robin