Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware

"Templin, Fred L" <Fred.L.Templin@boeing.com> Tue, 02 February 2010 22:08 UTC

Return-Path: <Fred.L.Templin@boeing.com>
X-Original-To: rrg@core3.amsl.com
Delivered-To: rrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id A028D3A6811 for <rrg@core3.amsl.com>; Tue, 2 Feb 2010 14:08:57 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.499
X-Spam-Level:
X-Spam-Status: No, score=-6.499 tagged_above=-999 required=5 tests=[AWL=0.100, BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id wVqoJHHe86WF for <rrg@core3.amsl.com>; Tue, 2 Feb 2010 14:08:53 -0800 (PST)
Received: from stl-smtpout-01.boeing.com (stl-smtpout-01.boeing.com [130.76.96.56]) by core3.amsl.com (Postfix) with ESMTP id C4D553A67A6 for <rrg@irtf.org>; Tue, 2 Feb 2010 14:08:53 -0800 (PST)
Received: from slb-av-01.boeing.com (slb-av-01.boeing.com [129.172.13.4]) by stl-smtpout-01.ns.cs.boeing.com (8.14.0/8.14.0/8.14.0/SMTPOUT) with ESMTP id o12M9PtS026762 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 2 Feb 2010 16:09:29 -0600 (CST)
Received: from slb-av-01.boeing.com (localhost [127.0.0.1]) by slb-av-01.boeing.com (8.14.0/8.14.0/DOWNSTREAM_RELAY) with ESMTP id o12M9PRg016498; Tue, 2 Feb 2010 14:09:25 -0800 (PST)
Received: from XCH-NWHT-05.nw.nos.boeing.com (xch-nwht-05.nw.nos.boeing.com [130.247.25.109]) by slb-av-01.boeing.com (8.14.0/8.14.0/UPSTREAM_RELAY) with ESMTP id o12M9EKM016050 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=OK); Tue, 2 Feb 2010 14:09:25 -0800 (PST)
Received: from XCH-NW-01V.nw.nos.boeing.com ([130.247.64.120]) by XCH-NWHT-05.nw.nos.boeing.com ([130.247.25.109]) with mapi; Tue, 2 Feb 2010 14:09:18 -0800
From: "Templin, Fred L" <Fred.L.Templin@boeing.com>
To: Robin Whittle <rw@firstpr.com.au>, RRG <rrg@irtf.org>
Date: Tue, 02 Feb 2010 14:09:17 -0800
Thread-Topic: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware
Thread-Index: Acqh1jcn2zPQLsMXR+ukRDhGWvkMUwCdOGJQ
Message-ID: <E1829B60731D1740BB7A0626B4FAF0A64950FEC98C@XCH-NW-01V.nw.nos.boeing.com>
References: <4B5ED682.8000309@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF 0A64950F33198@XCH-NW-01V.nw.nos.boeing.com> <4B5F8E7E.1090301@firstpr.com. a u> <E1829B60731D1740BB7A0626B4FAF0A64950F332A8@XCH-NW-01V.nw.nos.boeing. c om> <4B5FC783.4030401@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A649 50F3333F@XCH-NW-01V.nw.nos.boeing.com> <4B6103C8.6090307@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64950FEC1D3@XCH-NW-01V.nw.nos.boeing.com> <4B6473E5.1000508@firstpr.com.au>
In-Reply-To: <4B6473E5.1000508@firstpr.com.au>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Tue, 02 Feb 2010 22:08:57 -0000

Hi Robin,

> -----Original Message-----
> From: Robin Whittle [mailto:rw@firstpr.com.au]
> Sent: Saturday, January 30, 2010 10:01 AM
> To: RRG
> Cc: Templin, Fred L
> Subject: Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware
>
> Short version:     Mainly explaining my understanding of how
>                    SEAL does PMTUD when used as part of RANGER
>                    - when RANGER is acting as a Core-Edge
>                    Separation solution to the routing scaling
>                    problem.
>
>                    Both RANGER and SEAL can to other things
>                    besides this, and I find it hard to envisage
>                    the subset of their operations which would
>                    be used in a Core-Edge Separation setting.
>
>                    I still thing it is wrong to develop a complex
>                    and difficult protocol such as RFC4821,
>                    because of some badly designed tunnels which
>                    don't generate PTBs or due to a few networks
>                    which filter out PTBs.
>
>                    There's very little sign of anyone wanting to
>                    develop or use RFC4821 - so I guess the
>                    need or desire for it isn't be very strong.
>
>
>
> http://tools.ietf.org/html/draft-templin-intarea-seal-08
>
>
> Hi Fred,
>
> Thanks for your reply:
>
> >> So I disagree with your statement except to the extent of
> >> applications which do no packetization and just rely on the stack's
> >> TCP (or SCTP or whatever) packetization layers which may or may not
> >> implement RFC4821.
> >
> > Yes; thanks for the correction. RFC4821 concerns packetization
> > layer path MTU discovery, where the application itself is a
> > packetization layer when, e.g., UDP is used as the transport.
>
> OK.  My understanding is that in the stack there is, for each
> destination address, an MTU number which can be read and written by
> any packetization layer.  The TCP packetization layer in the stack is
> the most obvious one, but I guess if there was an SCTP layer there
> too, this would also read and write this variable (or is it multiple
> variables for each destination address?).  Then, for as many
> applications have packetization layers, they would be able to read
> and write these variables too.
>
>
> >> When IPTM sends a big packet (containing most of a traffic packet) as
> >> part of its PMTU probing, if this hits an MTU limit, then it is
> >> dropped, with a short PTB going back to the ITR.  With SEAL, in IPv4
> >> mode, the limiting router has to split the packet up and forward it,
> >> so other routers and the ETR have to carry all the fragments.  So my
> >> IPTM approach is arguably lighter weight than your SEAL approach.
> >
> > With SEAL-FS, the ETE does not carry all the fragments.
> > Instead, it uses first-fragments for reporting purposes
> > only and otherwise discards all fragments.
>
> OK - thanks.
>
>
> >> IPTM doesn't rely on the PTB. (See below for how it will be able to
> >> work with minimal length IPv4 PTBs.)  As long as a PTB does get to
> >> the ITR - which it would in most cases - then the ITR knows about the
> >> MTU problem without having to wait for the ETR to time out and send a
> >> message to the ITR saying the big packet did not arrive.  Also, the
> >> ITR gets an exact MTU value from this PTB, rather than having to do
> >> what SEAL does - hunt back and forth to find a packet size which is
> >> reliably delivered without MTU problems.
> >
> > SEAL doesn't hunt back and forth.
>
> 4.3.9.1.2 mentions an "iterative searching strategy" - which sounds
> like a fancy term for "hunt"!  This occurs only in IPv4 when the ETE
> gets a first fragment shorter than 576 bytes, then this is
> interpreted as a "runt fragment" and so is not regarded as a true
> measurement of the limiting MTU.

Yes, this follows directly from the RFC1191 "plateau
table" approach to guessing MTUs when old routers fail
to fill in the MTU field in their PTB messages. What I
meant was that yes, SEAL will hunt "back" when there is
a router in the path that does weird fragmentation. But,
it does not hunt "forth".

> > In SEAL, every data
> > packet is an implicit probe, and the ETE uses IPv4
> > fragmentation as an indication that it needs to tell
> > the ITE to reduce the size of the packets it is sending.
>
> To function correctly in IPv4, SEAL also relies on the first fragment
> arriving at the ETE, and that fragment being the length of the
> shorted MTU between the ITE and the ETE.  There's not much chance of
> the first fragment being lost, so this is fine.
>
> As far as I know, IPv4 routers are not required to make their
> fragments (all but the last usually) the length of their MTU limit,
> but I guess most of them do.
>
> Assuming they all do, your IPv4 fragmentation approach has an
> advantage where there are two or more MTU limits in succession.
>
> If the first limit is 1400 and the second 1300, then by relying on
> PTBs, as you do with IPv6, then first of all, the ITE discovers the
> 1400 limit, and sends a PTB to the sending host (SH) with the correct
> value for that: 1400 - 24 (IPv4 header + SEAL header) = 1376.  The SH
> tries again, and the encapsulated packet generates a PTB at the
> router with the 1300 limit.  This is propagated back to the SH in the
> same way, with a PTB with a 1276 byte MTU - and then all is well.
>
> With your fragmentation approach, the ETE should get a first fragment
> of length 1300 bytes.  It reports this back to the ITE, and it sends
> a single PTB to the SH, with a value of 1276 bytes.
>
> Which would be faster depends on the circumstances - since the PTB
> back to the ITE will often be faster than the longer path all the way
> to the ETE.
>
> If the limiting router sends a fragment of length less than the
> limiting MTU, then the SEAL ITE would adopt an unrealistically low
> MTU value.  I guess this is unlikely.
>
>
> IPTM doesn't hunt back and forth.  When a long enough packet arrives
> at the ITR for the encapsulated length to fall between its two
> current markers for the Zone of Uncertainty (low and high water
> markers, I think, are the terms used in other protocols) the dual
> packet probe technique will usually result in reliable new
> information in most instances, due to one of:
>
>   1 - The traffic packet is delivered to the ETR and the ITR is
>       informed about this - so the ITR raises the low marker.
>
>   2 - The traffic packet was not delivered and the ITR gets a PTB.
>       This enables the ITR to lower its high marker and send a
>       PTB to the SH with the correct MTU value to cause the SH so
>       subsequent packets it sends, once encapsulated, will fit the
>       MTU limit exactly.
>
>   3 - If the long packet does not arrive at the ETR, but the short
>       one does, and the ITR receives no PTB, then perhaps the long
>       packet was merely lost.  However, if this happens repeatedly
>       then the ITR should discern that there is an MTU limit below
>       this length, and so adjust its upper marker downwards.  This
>       will result in a PTB going to the SH, which will send
>       shorter packets.  Over multiple iterations, the ITR will
>       discover the MTU.  If the steps downwards taken by the ITR
>       in reducing its upper marker are small, this will take a long
>       time.  If they are larger, the true MTU will be found faster,
>       but there may be more overshoot, resulting in somewhat
>       smaller MTU for all hosts whose packets are tunneled to this
>       ETR than is really needed.
>
>       Once a pair of probe packets does arrive at the ETR, the ITR
>       is informed of this by the ETR and it raises its upper marker
>       accordingly
>
>  Pretty quickly, the two markers would reach the same value and the
> ITR would have a reliable measure of the MTU to this ETR.
>
> The most likely pattern would be this:  Initially (before the ITR has
> ever sent packets to this ETR address) the markers are wide apart.
> Then the SH sends a packet which longer, once encapsulated, than the
> low marker and also longer than the actual tunnel MTU, minus
> encapsulation overheard.  So the the ITR does the two packet probing
> protocol and gets a PTB.  Assuming this is from the router which sets
> the lowest PMTU limit in the tunnel, then this PTB enables the ITR to
> send a PTB straight back to the SH.  This also enables the ITR to
> lower the upper marker to match the MTU reported in the PTB it just
> received.
>
> The SH will create another packet of the correct length to suit the
> tunnel (after encapsulation it will be the length specified in the
> PTB from the limiting router).  The ITR gets this and because it
> would be longer, after encapsulation, than the low marker, the ITR
> again sends it to the ETR with the two-packet tunneling protocol.
> This time, it gets through, and the ETR reports this.  As soon as
> this report gets to the ITR, the ITR adjusts its lower marker to the
> same value as the upper marker - so there is no more Zone of
> Uncertainty.
>
> The SH will generally continue to send packets of this length or
> shorter.  If some other SH (or another application in the same SH)
> sends a packet which needs to be tunneled to the same ETR and which
> is longer than this now reliably known PMTU value, then the ITR will
> drop it and send back a PTB, without trying to send it into the
> tunnel.  That application or SH will then send packets of the right
> length.
>
> I haven't figured out every detail of this - at some stage I intend
> to work on it more and write it up as an ID.  For now, it is at:
>
>    http://www.firstpr.com.au/ip/ivip/pmtud-frag/
>
>
>
> >> In Ivip, most traffic packets are encapsulated by the ITR with the
> >> sending host's address as the outer header's source address.  Any PTB
> >> which results from those goes to the sending host, which will not
> >> recognise it.
> >
> > If the source address of the original packet also
> > goes as the source address of the outer packet, then
> > wouldn't that constitute mixing both EID and RLOC
> > addressing within the same routing region? I thought
> > the whole purpose of the CES approach was to keep the
> > EID and RLOC routing and addressing spaces separate.
>
> The "Separation" means that a subset of the global unicast address
> space is used as "SPI" space (Ivip) or "EID" space (LISP).  This is
> not a separate namespace, just a subset of the global unicast address
> space, in the form of multiple DFZ-advertised prefixes which have no
> name in LISP, but which are called Mapped Address Blocks (MABs) in Ivip.
>
> The ITR tunneling uses the sending host's source address as the outer
> source address for all ordinarily tunneled packets.  This enables any
> ISP BR filtering - which drops incoming packets due to them having a
> source address from any one of the ISP's prefixes - to be enforced on
> the inner packet by all ETRs, by the simple method of dropping any
> inner packet whose source address is different from the outer
> header's source address.
>
> This functionality is also enforced with the 2 packet probing
> arrangement - the short A packet is also sent with its outer source
> address being that of the sending host.
>
> When a packet would be, after encapsulation, a length within the
> "Zone of Uncertainty" then the ITR uses the special long (B) and
> short (A) protocol.  The B packet's outer source address is that of
> the ITR - so the ITR would normally get any PTB which arises from the
> B packet.
>
> The "sending host" address could be a conventional (non-SPI) address
>   such as from a host on PA or PI space - or it could be on an SPI
> address.  This is the source address of the packet the ITR is
> processing.  Whether that is the actual address of the sending host,
> or just that of a NAT box which the sending host is behind, is not
> known to the ITR.
>
> All hosts and most routers make no distinction between addresses in
> the SPI subset and the rest of the addresses in the global unicast
> address range.  Only ITRs treat them differently - and then only when
> they appear in the destination address of a packet which is forwarded
> to them.  Instead of forwarding the packet according to its
> destination address, the ITR's FIB processes the packet differently.
>
> If the ITR's FIB (which may be all in software, since the ITR may be
> in the sending host or be implemented on a COTS server) already has
> mapping for a micronet which matches this SPI address, it uses that
> mapping (a single ETR address) to tunnel the packet.  If not, it
> buffers the packet, requests mapping from a nearby QSD (full database
> query server), installs the mapping (a micronet start and end
> address, with a single ETR address) in its FIB and then tunnels the
> packet accordingly.
>
> There's no concept of "routing region" in Ivip.  ITRs in any place
> would do exactly the same thing.  All other devices - hosts and
> ordinary routers - make no distinction between SPI addresses and
> remainder of the addresses in the global unicast address range.
>
>
>
> >> In this scenario, the ITR gets back just the IPv4 header and the UDP
> >> header.  The attacker has to guess the 16 bit ID field in the IPv4
> >> header, which is tricky - but it could eventually succeed in doing
> >> so.  Here are the components of the UDP header:
> >>
> >>   Source port     The ITR could use a randomized source port.  This,
> >>                   combined with the 16 bit ID field, could extend
> >>                   the number of bits to be guessed to 32 - which
> >>                   I think is sufficiently secure, considering a
> >>                   successful attack only degrades efficiency, rather
> >>                   than causes actual loss of connectivity.
> >>
> >>   Destination port   Currently, I assume there is a single UDP port
> >>                   on all ETRs to send the long (B) packet to.  If
> >>                   I could easily randomize this too - such as making
> >>                   the most significant 8 bits fixed, and the others
> >>                   up to the ITR to choose.
> >>
> >>                   This would be 40 random bits - perfectly secure
> >>                   considering the moderate level of DoS the attack
> >>                   could result in.
> >>
> >>   Length          If the attacker created the traffic packet, they
> >>                   would know the length of what follows the UDP
> >>                   header.
> >>
> >>   Checksum        Ahhh - this is not a header checksum.  This covers
> >>                   the data behind the UDP header.  This data is
> >>                   mainly from the traffic packet, but it contains
> >>                   a nonce.  So the 16 bit checksum is affected by
> >>                   the nonce.
> >>
> >> I hadn't realised this before - the UDP checksum contains another
> >> 16 bits the attacker has to guess.  Combined with the IPv4 header's
> >> 16 bit ID field, I think this makes it highly secure.  If this is
> >> not enough, the 16 bit random ITR source UDP port should be sufficient.
> >>
> >> So the ITR doesn't need any more bits than are necessarily supplied
> >> by a minimally compliant RFC1191 implementation in the router which
> >> sends the PTB.
> >>
> >> How would this work for SEAL?
> >
> > Using the UDP/TCP checksum as a nonce requires that the
> > ITE cache copies of its recently-sent packets.
>
> The above procedure is only for the long (B) packet when the ITR is
> still uncertain of the PMTU to a given ETR and the packet, if
> normally encapsulated, would be of a length within this Zone of
> Uncertainty.  The B packet is the same length, but is sent with the
> ITR's address in the outer header.  Like the short A packet, it is a
> UDP packet with special headers.  All the normally encapsulated
> packets are IP-in-IP, with no other headers, and with the sending
> host's address in the outer source address.  So normally encapsulated
> packets can't generate PTBs to the ITR - only to the sending host,
> which would not recognise them, since they arise from an encapsulated
> packet.
>
> So only for these traffic packets for which the ITR is using to
> generate the B and A probe pair does it need to cache enough of the
> initial packet to be able to generate a valid PTB to the sending
> host.  The ITR would also cache the nonce, which it uses to secure
> the ETR's response to these packets.
>
> It would also cache the UDP header of the B packet, which includes
> the checksum which is almost impossible for an attacker to guess due
> to its dependence on the nonce.  I think the combination of an
> unpredictable 16 bit ID in the outer header of the B packet, and the
> influence of the nonce on the 16 bit checksum, would be sufficient to
> prevent attacks succeeding at a significant rate.  If that wasn't
> enough, the ITR could use 16 bit randomization of its UDP source
> port, and randomize 8 or more bits of the destination UDP port too.
>
>
> > But then,
> > it would need to do this for every tunnel it belongs to
> > and it has no way of knowing for how long it will have
> > to retain the cached copies.
>
> As I noted above, the ITR doesn't cache its ordinarily encapsulated
> packets.  The PTB would not go to the ITR.  It only needs to cache
> the start of those packets it is sending with the two-packet probing
> protocol.
>
> If the ITR estimates the PMTU to a given ETR, and gets it right - and
> then later the MTU falls, then there is a difficulty.  The normally
> encapsulated packets which are too long will be dropped, the ITR will
> not get any PTBs and the sending host will not recognise the PTBs
> which are sent to it.
>
> I can think of two approaches to minimising the impact of this.
>
> One is to have the ITR periodically, such as every 30 seconds, send a
> packet which (once encapsulated) is at, or close to, the MTU limit,
> as a B and A probe pair.  This is assuming the ITR is continually
> sending long packets to this ETR.
>
> This will normally deliver the packet fine, and the ITR will be able
> to confirm that the PMTU has not become any less than what it
> assumes.  If the packet doesn't arrive, or if the ITR gets a PTB,
> then perhaps the MTU has dropped and the ITR can find out what it has
> dropped to.
>
> It will usually find out the new PMTU from a PTB, but if there is no
> PTB, then it will need to send more probes of various lengths until
> one size does get through to the ETR.
>
> It would do this as described above, without having to generate
> special probe packets, by lowering its upper marker for the MTU
> estimate by some value, such as 8 or 16 bytes, sending a PTB to the
> sending host (or multiple sending hosts as they send packets which
> would exceed this length, once encapsulated) and allowing the sending
> hosts to create traffic packets which the ITR will send using the 2
> packet probing technique.  This will rapidly reduce the value of the
> upper marker to being equal to, or somewhat less than, the real PMTU
> limit.  This will drag down the lower marker too.
>
>
> The following second approach would only work if the returned part of
> the packet was long enough to show the SH that it really did result
> from a packet the SH sent.  In practice, this should always be the
> case with IPv6, and I have been gained the impression that it is
> common for IPv4 routers to send back more than the bare RFC1191
> minimum anyway.  This would require the sending host to have a
> modified stack which was ready to analyse PTBs which resulted from
> packets in the ITR to ETR tunnel.  Assuming this modified stack code
> could verify that the PTB was genuine, it would compute a new MTU for
> this destination address, by subtracting the encapsulation overhead
> (20 for IPv4, 40 for IPv6) from the MTU value in the PTB.
>
> This one PTB would not help the SH learn about a reduced PMTU to
> other SPI destination addresses it was sending to which also were
> being tunneled to the same ETR.  But the same code would receive PTBs
> from those packets too.  The ITR would be none-the-wiser, so the
> first technique and the one below would still be important - but this
> optional host upgrade would enable the SH to respond immediately and
> correctly to a reduction in PMTU to an ETR.  Other applications in
> other SHs would need to repeat this exercise, since the ITR doesn't
> know these PTBs are occurring.
>
> After 10 minutes, a sending host is allowed (RFC1191 / 1981) to try
> sending a longer packet than was allowed by a previous PTB.  Then,
> the ITR needs to recognise the time which has elapsed and use this
> with the B and A probe technique.
>
> This may be a little complex, considering multiple applications in
> one sending host, multiple hosts and multiple destination SPI
> addresses may all result in packets being sent to the one ETR address.
>
> I can see ways of coping with this stuff - but some of it requires
> carefully designed algorithms and considerable logic and state in the
> ITR.
>
> If we can upgrade the DFZ and other routers with firmware, then all
> this encapsulation and PMTUD stuff can be ignored - by using Modified
> Header Forwarding instead.  That is the way to do it in the long-term
> future, even if we start with encapsulation.
>
>
> > With SEAL, the ITE never
> > has to cache packets in order to match them up with
> > any PTB feedback.
>
> Yes - here's my understanding of how your SEAL ID specifies how the
> ITE and therefore the SH perform PMTUD in the tunnel path, between
> the ITE and the ETE.
>
> I assume in all cases that the ITE initially sets S_MRU for this ETE
> to "infinity" as described in 4.3.3, and then uses one of the
> following methods to reduce it.  This is me trying to imagine how the
> SEAL would be used for ITE to ETE tunneling when RANGER is used as a
> Core-Edge Elimination architecture.
>
> RANGER can be used for many more purposes than this, and so can SEAL,
> so it is quite a challenge to decide which parts of the IDs to
> ignore.  I understand that in this application, the ITE will be
> reducing the S_MRU value for each ETE it tunnels to.  I think that in
> other SEAL applications, this may not occur, and so you have
> arrangements for using SEAL segmentation to send long traffic packets
> as multiple SEAL-segmented packets to the other end of the tunnel.
> But this would never be invoked in a Core-Edge Separation
> application, since the ITE always sends PTBs to the SHs to have them
> reduce their packet length.
>
> In this application of SEAL, I understand there is no need for any
> mid-layer protocol between the IPv4 or IPv6 header and the SEAL
> header, or between the SEAL header and the traffic packet.  This is
> not clearly specified anywhere, since the SEAL and RANGER documents
> are general purpose, and their use for a scalable routing solution as
> a Core-Edge Separation architecture is only one thing they could be
> used for.

In some environments, it may be necessary to insert a
mid-layer UDP header in order to give ECMP/LAG routers
a handle to support multipath traffic flow separation.

> Firstly I describe my understanding of what your ID specifies for IPv4.
>
> Secondly I describe two other ways you might do PMTUD with IPv4,
> without using DF=0 packets.  These would avoid whatever risk there
> might be of setting the ITE's PMTU estimate too low due to a limiting
> router sending out fragments which are shorter than the limiting next
> hop MTU.
>
> Finally, I describe my understanding of what your ID specifies for IPv6.
>
> This is partly for my own reference, since it took me many hours to
> discern this by reading the SEAL ID and corresponding with you.
>
>
> IPv4:
>
>   The ITE sends a DF=0 packet into the tunnel.  This starts with
>   an IPv4 header, then has a SEAL header (there's no mid-level
>   protocol in this Core-Edge Separation usage of RANGER and SEAL)
>   and then the inner packet, the original IPv4 traffic packet.
>
>   The source address in the outer header is that of the ITE and
>   the destination address is that of the ETE.  The 32 bit
>   SEAL ID is split in two.  16 bits go into the IPv4 header's
>   ID field and 16 into the SEAL header's ID Extension field.
>
>   The limiting router in the tunnel (the one where the next-hop
>   MTU is less than the the length of this whole packet) fragments
>   it into at least two fragments.
>
>   Now the second para in 4.4.2 comes into play:
>
>         When the ETE processes the IP first-fragment (i.e.,
>         one with MF=1 and Offset=0 in the IP header) of a
>         fragmented SEAL packet, ...
>
>   The first para was for reassembling packets which had been
>   fragmented by the SEAL protocol.  But the second para is for
>   SEAL packets, as was just sent, being fragmented by a
>   router between the ITE and the ETE.  This only occurs for
>   IPv4 and I think it would be helpful to mention IPv4 in this
>   paragraph.  Maybe it needs its own section.
>
>        ...  it sends a "Reassembly Report - Fragmentation
>        Experienced" message back to the ITE with the S_MSS field
>        set to the length of the first-fragment and with the
>        S_MRU field set to no more than the size of the reassembly
>        buffer (see Section 4.4.5).
>
>   I think this last part about the value of S_MRU is not clear
>   enough.  What value should it be set to?
>
>   I will assume it is set to some non-zero value.
>
>   Assuming the limiting router sent out the first fragment with
>   a length equal to the limiting next-hop MTU, then this MTU
>   value is now in the S_MSS field of the message sent to the
>   ITE.
>
>
>   This message arrives at the ITE.  This message, according to
>   Figure 4, contains:
>
>        As much of invoking packet as possible without the
>        message exceeding 576 bytes.
>
>   Maybe your ID specifies this, but I am having trouble
>   following it - there has to be a way the ITE securely
>   accepts this "Fragmentation Experienced" message.
>
>   As far as I know, the ITE looks into the message, finding
>   the initial part of the packet which the ETE received as
>   a first fragment.  That will contain the outer IPv4 header
>   and the SEAL header, and from this these the 32 bit SEAL ID
>   in the SEAL encapsulated packet can be found.
>
>   I think you either cache the recently sent 32 bit SEAL IDs
>   or maintain a sliding window function over their range so
>   you can easily identify a value which was used in the last
>   second or two.  In a given ITE, each ETE has its own SEAL
>   ID counter.  Its value is intitialized randomly when the
>   state for this ETE is created.  After that, its value
>   increments with each each packet sent to the ETE.
>
>   (I may adopt this incrementing value per ETR arrangement,
>   with its sliding window, rather than using a nonce.)
>
>   The wider the window in time, the longer you can accept these
>   messages.  Since the ETE and the ITE could be on opposite
>   sides of the Net, I guess you need to have a window which
>   accepts SEAL IDs sent at least a second ago.
>
>   The longer the window in time, and the more packets the
>   ITE sends to this ITE, the wider the window is numerically
>   and the easier it is for an attacker to guess a valid value
>   and have the ETE accept a PTB with a low enough value
>   to cause lost efficiency - for the next 10 minutes or so.

The above is all correct wrt the window management. The
ITE can ensure that the window size remains bounded by
sending periodic explicit probes (e.g., once explicit
probe per every N data packets).

>   Now to 4.3.9.1.2:
>
>        4.3.9.1.2. Fragmentation Experienced (Code=1)
>
>        If the value in the S_MRU field is non-zero, the
>        ITE records the value in its soft state for this ETE.
>
>   This means this value is stored in the S_MRU variable for
>   this ETE, as defined in 4.3.3.  As noted above, I am not
>   clear on what value was written into this field of the
>   report by the ETE.
>
>        The ITE then adjusts the S_MSS value in its soft state.
>
>   This means this value is stored in the S_MSS variable for
>   this ETE, as defined in 4.3.3, subject to the instructions
>   in the next few sentences.
>
>   I am a bit confused about the differing roles of these two
>   variables.
>
>        If the S_MSS value in the Reassembly Report is greater
>        than 576 (i.e., the nominal minimum MTU for IPv4 links),
>        the ITE records this new value in its soft state.
>
>   OK - this is based on the assumption that the length of the
>   first fragment received by the ETE reflects the limiting
>   MTU of the ITE to ETE path.
>
>        If the S_MSS value in the report is less than the current
>        soft state value and also less than 576,
>
>   How could the ITE's S_MSS value for this ETE be less than
>   576?  I can't see how.  If it can't be, then the first part
>   of the above sentence may be redundant.

No, the sentence is correct. It is possible for the ITE
to need to reduce its cached S_MSS value to a size less
than 576 if there is truly a link with a small MTU (e.g.,
256) on the path. Although 576 is often considered to
be the "nominal" minimum MTU for IPv4 links, the actual
minimum MTU is only 68 bytes per RFC791.

>        the ITE can discern that IP fragmentation is occurring
>        but it cannot determine the true MTU of the restricting
>        link due to a router on the path generating runt
>        first-fragments.
>
>   Then the next paragraph describes the "iterative searching
>   strategy" to find the correct (or near enough, but perhaps
>   lower) value for the S_MSS variable for this ETE.
>
>   I think this paragraph is unclear.  I think it should state
>   that the probes are occurring only due to traffic packets
>   arriving at the ITE and being tunneled to this ETE, and
>   these being long enough.  Since SEAL treats all packets
>   as probes, this use of the term "probe" may be confusing -
>   since in fact all packets may be probes.
>
>   I think this paragraph should describe the process in detail
>   - I guess it is only occurs with real traffic packets.
>
>   The reference to section 5 of RFC1191 is not very helpful
>   because it describes several algorithms.
>
>
>
> IPv4 - my suggestion for doing it with DF=1 packets
>
>   A - If you can be sure the routers send back more than
>       the bare minimum IPv4 header + 32 bits.  (Its just
>       firmware updates to have routers do this and maybe
>       most or all of them already do.)
>
>       Send the SEAL packets as noted above, but with DF=1.
>
>       If there is an MTU problem, the ITE will get a PTB
>       with the MTU value it needs, plus enough of the
>       SEAL packet to extract sufficient of the traffic
>       packet to make a valid PTB for the SH.
>
>       The MTU value from the received PTB is written into
>       the S_MTU for this ETE.  This gives an exact value
>       without the problem of potential "runt packets" which
>       arises with DF=0 in your current process.
>
>       The ITR subtracts 24 from the MTU value it received
>       in the PTB from the limiting router (20 bytes of IPv4
>       header + 4 bytes of SEAL header) and uses this value
>       in the PTB to the SH.
>
>       This will work fine - the SH will then send packets
>       of the correct size, so when they are SEAL encapsulated
>       they will not have a problem with this MTU limit.
>
>       If any other SH, or another application in the same
>       SH, sends a packet whose length exceeds the new MTU
>       value minus 24, then the ITE will send back a PTB
>       accordingly.
>
>       If there is a further, lower, MTU limit en-route to
>       the ETE, then the above process will be repeated.
>       This is similar in principle to your IPv6 approach.
>
>
>   B - If you have to assume that some or all routers between
>       the ITE and the ETE only send back the bare minimum
>       amount of packet in their PTB, then you can still
>       accept these packets securely, and calculate a proper
>       MTU value to send to the SH, as described above.
>
>       In order to be able to generate a valid PTB, you need
>       the ITE to have cached the IPv4 header and the next
>       32 bits which follows (24 bytes) for each packet sent
>       which you think might give rise to a PTB.  You don't
>       need to do this with packets shorter than some constant
>       - depending on whatever is the lowest PMTU you ever
>       expect to find between an ITE and an ETE.
>
>       I guess you only need to cache these 24 byte items for a
>       second or so.  You need to be able to index into the
>       cache by using the 32 bit SEAL ID retrieved from the
>       16 bit IPv4 ID and the 16 bit SEAL ID Extension in the
>       initial part of the encapsulated packet, which is in
>       the first fragment, as returned in the PTB.
>
>
> IPv6:
>
>   The ITE creates an IPv6 header and a Fragment Header.  As far
>   as I know, there is no SEAL header.  I think this should be
>   made more clear in the final part of 3.4.3.
>
>   The 32 bit SEAL ID is written into the Identification field
>   of the Fragment Header.  The ITE then appends the traffic
>   packet.
>
>   The result is forwarded towards the ETE.  If it is too big for
>   a next-hop MTU in any router en-route to the ETE, that router
>   sends back a PTB to the ITE with an MTU value, and enough of
>   the original packet for the ITE to construct a valid PTB to the
>   SH.
>
>      (I can't find where your ID describes the reception of
>       the PTB.  Section 4.3.8 should cover this, but makes no
>       specific mention of PTB messages.)
>
>   The ITE secures the acceptance of the PTB by using comparing
>   the 32 bit SEAL ID, as noted for IPv4 above, via a cached set
>   of recently used values or some kind of window function
>
>   The MTU value is written to the S_MTU for this ETE.
>
>   The ITE subtracts 48 from the MTU value (40 bytes for the IPv6
>   header and 8 bytes for the Fragment Header) and uses this to set
>   the MTU value in the PTB which is sent to the SH.
>
>   As with the IPv4 approach, any packets arriving at the ITE
>   which will be tunneled to this ETE, if longer than the MTU
>   value minus 48, will be dropped and used to send a PTB to
>   the SH.
>
>   This is identical in principle to my IPv4 suggestion A above.
>
>
> Returning to Ivip's IPTM protocol:
>
> With IPv6, I could avoid caching any part of the packet if I could
> rely on the ITR getting a PTB, since the PTB is guaranteed to contain
> plenty of the inner packet - enough for the sending host to
> recognise.  (Actually, the minimum amount of original packet returned
> could be less than what should be returned to the SH, due to the
> encapsulation IP, UDP and IPTM header which precedes it.  Still,
> enough should be there that any SH should be able to recognise it.)
>
> I still need to send two packets, since the long B packet does not
> contain the full traffic packet.  It has extra things - a UDP header
> and an IPTM header, with a 32 bit nonce.   The last part of the
> traffic packet is not in the B packet
>
> The part which doesn't fit is contained in the A packet.
>
> If both A and B arrive at the ETR, the whole traffic packet is
> delivered.  The A packet is matched to its B packet with the nonce
> they both contain in their IPTM headers.
>
> The ETR will send a message to the ITR, also secured by the nonce,
> to tell it that both parts arrived.
>
> If the B part doesn't arrive, after a ~0.5 second time-out, the ETR
> will send a message to the ITR telling it that only the A part arrived.
>
> If only the B part arrives, the ETR sends a message to the ITR to
> that effect too.
>
> With IPv4, if it could be assumed that all routers would return
> sufficient of the packet to include the first 24 bytes of the inner
> packet, then I could use the same approach as for IPv6 - and so avoid
> caching any part of the traffic packet in the ITR.
>
> If I couldn't assume this, then there are two approaches:
>
>   1 - Do as for my suggestion B above - have the ITR cache the
>       first 24 bytes of traffic packets used for this 2 packet
>       probing technique.  The cache would be indexable via the
>       nonce sent with each packet - and the caching time would
>       be about a second.
>
>   2 - Avoid caching in the ITR, by including the first 24 bytes
>       of the traffic packet in the A packet.  If the A packet
>       arrives at the ETR, and the B packet doesn't, the ETR
>       can report this, and return the 24 bytes from the A
>       packet.
>
> The ITR is only doing this 2 packet probing technique infrequently,
> so the caching approach is not particularly expensive.  Caching the
> first 24 bytes of the packet has an advantage that when the ITR gets
> a PTB - as would normally be the case if the B packet was too long -
> then the ITR can send the PTB immediately, rather than waiting for a
> message from the ETR, which would necessarily arrive at least a
> second after the ITR tunneled the packet.
>
>
> The detection of a PMTU limit doesn't have to be absolutely
> bullet-proof.  It should not result in the ITR deciding that the PMTU
> is lower than it actually is, but if, for some reason, the probe
> process produces an indeterminate result - such as the ITR not
> getting anything back from the ETR as a result of the A or B packet,
> and no PTB either, then the ITR takes no further action.  This is
> indistinguishable from ordinary packet loss.  The most likely outcome
> is that the SH will try again, with a similar sized packet (unless it
> is doing RFC 4821 ... which no hosts appear to be doing at present)
> and the ITR will again generate the B and A packets.  Then, the most
> likely outcome will be that the ITR learns something definite about
> the PMTU, and so reduces its Zone of Uncertainty.
>
> Doing this PMTUD stuff in the FIB of a big router handling gigabit
> and 10 gigabit links could be quite challenging.  It doesn't have to
> be done this way, since Ivip ITRs and ETRs can be implemented in
> software on ordinary servers, which are inexpensive and can still
> handle (I guess) gigabit traffic rates.  Also, having the ITR in the
> sending host is a zero cost way of ensuring each ITR doesn't have to
> juggle too many of these PMTUD probing sessions at once.
>
>
> >> At present, for IPv4 and IPv4, your ITE (ITR) functions emit packets
> >> with an outer header of IPv4 or IPv6, followed by a 32 bit SEAL header.
> >>
> >> Immediately following the SEAL header you may have some "mid-layer
> >> headers" which I don't properly understand.  Then you have the IPv4
> >> or IPv6 traffic packet, or perhaps a segment of it.
> >>
> >> You could make the SEAL ITE work fine with minimal length IPv4 PTBs
> >> if the SEAL header was extended to 64 bits, with the additional 32
> >> bits being a nonce.  That would always be returned in any PTB.
> >
> > SEAL uses the 32bit ID (gotten from the 16bit IPv4 ID
> > concatenated with SEAL's 16bit ID extension) as a nonce.
> > There is no need that I can see for including an
> > additional nonce.
>
> OK.
>
>
> >> So I think your objection to using RFC1191 PTBs should only be based
> >> on your concern about the PTBs being systematically dropped due to
> >> filtering.
> >>
> >> I assert that such filtering is a symptom of a badly administered
> >> network - and that it should be fixed in the network, not worked
> >> around with a protocol such as SEAL or IPTM.
> >
> > In my understanding, in the interdomain routing region of
> > the Internet there is no close coordination regarding the
> > way "the network" is administered. There is also a wide
> > variety of network vendor equipment deployed in the
> > Internet which may have widely varying default behaviors.
> > So, in general it seems overly optimistic to assume that
> > all of the diverse policies, implementations and operational
> > practices out there could be brought into strict uniformity.
>
> OK - but at some point we need to stop adopting band-aid measures
> like artificially limiting MSS or MTU values.  That just lets the PTB
> filtering and lousy tunnels be less noticed.   We should not be
> trying to upgrade the stacks of all hosts in the world because a few
> end-user networks filter PTBs or ISPs and perhaps end-user networks
> run tunnels which don't support the otherwise perfectly good RFC 1191
> / 1981 PMTUD techniques.
>
> We would just be heaping limitations and complications on ourselves
> in an overly-defensive, expensive and inefficient attempt to cope
> with failure of a few ISPs and end-user networks to run the Internet
> as it needs to be run.  We are paying the ISPs.  The end-user
> networks which are filtering PTBs are disrupting a subset of their
> own communications.

I still think there are problems out there. I will post
another message on this soon.

> I just think it wrong in principle to develop messy new protocols
> such as RFC 4821 to cope with these failings.

In my opinion, packetization layers are operating "at risk"
if they use packet sizes larger than 1500 but are not in
some way checking with the final destination to ensure that
the big packets are actually getting through. RFC4821 is
a method for the source to do just that without requiring
any changes on the destination. But to be sure, SEAL does
not *depend* on RFC4821 but rather *sets the stage* for
RFC4821 and/or any functional equivalents.

Thanks - Fred
fred.l.templin@boeing.com

>   - Robin