Re: [tcpm] Detect Lost Retransmit with SACK
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tcpm] Detect Lost Retransmit with SACK
Hi group,
I forgot to mention the actual testing scenario I was doing, to profile all these
TCP stack against.
Basically, I used a userland TCP "forging" tool, where each frame can be individually
crafted (content, timing, loss).
My test opens a tcp session (http get request, for simplicity's sake), with SACK
negotiated and then counts the segments being received, behaving (mostly) like a
well-behaving TCP client. However, the segments with these numbers:
200, 250, 253, 255, 257, 258, 259, 260, 265, 267
are dropped this number of times they are seen in the stream:
1, 1, 1, 1, 1, 2, 1, 1, 1, 1
The grace period of 200 packets is to have a decently wide open cwnd; the drop at
Segment 200 also serves to check if the cwnd is larger than 50 segments when the
Burst drop (250-267) occurs, and also to "prime" the SACK scoreboard (preventing
the sender from fastpathing). The burst in this case is in the time axis and, with
segment number 258, sequence space axis...
None of the TCP Stacks I have investigated so far, were able to recover without
a RTO (between 0.2 and 1 sec later; Windows 7 was particularly peculiar, as it
starts shifting the original segments after the 2nd or 3rd dropped segment;
it seems to retransmit 1/2 1 1 1 1/2 segments if a contingeous hole > 1 segment
is being announced by SACK... :) But my code still drops the one containing the
258th sequence number again, leading even Win7 to a RTO...
And, on another front, I have checked a few systems in the field (our gear
Is run typically in high-speed (1/10 Gbps) LANs; I found one example where
Nearly 50% of the retransmissions where followed by a RTO, and even the
Less loaded systems showed a quite high number of RTOs (15-35%) after
Retransmissions.
I assume at this point, that only a minority of the RTOs is "legitimate" in
the sense that
*) TCP Session is not running with SACK
*) Client was forcefully removed from the network (loss of connectivity)
Which leaves probably between 70 and 95% or the RTO events as "burst loss"
candidates, where keeping the DUPACK detection armed during FastRetransmit
would help.
I will see to it, that I get statistically more relevant data, and also
Put this into context (i.e. total segments transmitted per week vs. total
retransmitted segments per week vs. retransmit timeout events per week).
(Actually, I got scared at first, when I saw that high-load system reporting
50% of all retransmissions are followed by RTOs... :) ).
Richard Scheffenegger
Field Escalation Engineer
NetApp Global Support
NetApp
+43 1 3676811 3146 Office (2143 3146 - internal)
+43 676 654 3146 Mobile
www.netapp.com
Franz-Klein-Gasse 5
1190 Wien
-----Original Message-----
From: Scheffenegger, Richard
Sent: Montag, 9. November 2009 18:27
To: Alexander Zimmermann
Cc: tcpm at ietf.org
Subject: Re: [tcpm] Detect Lost Retransmit with SACK
Hi Alexander,
Thanks for the welcome :)
I fork another thread with the LimitedTransport||FastRecovery / ABC interaction...
I will try to sketch up an example to demonstrate what problem I'm trying to address:
Let's assume the cwnd is already open for at least 7 segments, before the segment with sequence number 10000 is the first one to be dropped by the network.
Also, let's assume that FastRetransmit runs from the left edge of the leftmost hole
(SND.UNA) upwards, and that per ACK only a single segment is sent.
Triggering ACK Left Right Left Right
Segment Edge 1 Edge 1 Edge 2 Edge 2
9000 9000
10000 (lost) *
11000 (lost)
12000 (lost)
13000 (lost)
14000 (lost)
15000 10000 15000 16000
16000 10000 15000 17000
17000 10000 15000 18000
3 ACKs trigger fast retransmit
10000 (lost again)
11000 10000 11000 12000 15000 18000
12000 10000 11000 13000 15000 18000
13000 10000 11000 14000 15000 18000
-> here we have again 3 ACKs indicating a another loss of one of the retransmitted
packets. The leftmost hole did not change, while the overall number of SACKed
octets did decrease for 3 consecutive ACKs (4; 3 and 2 segments marked by SACK).
Current behaviour of investigated TCP Stacks:
14000 10000 11000 18000
(normal transmit resumes)
18000 10000 11000 19000
19000 10000 11000 20000
20000 10000 11000 21000
21000 10000 11000 22000
22000 10000 11000 23000
:: :: :: ::
Eventually, RTO trips off, retransmitting the lost segment; this happens RTO later,
followed by slow-start...
50000 10000 11000 50000
::
::
10000 50000
However, this can be somewhere between 0.2 and 1.0 sec later with a "fresh" TCP
session (no prior connection properties known (cached) by sender). Most likely,
the cwnd has filled up already way sooner (as demonstated, the problem seems to be
most prominent in Highspeed LANs), so that for nearly as long, no data is actually
transmitted.
Proposed behaviour:
Triggering ACK Left Right Left Right
Segment Edge 1 Edge 1 Edge 2 Edge 2
9000 9000
10000 (lost) *
11000 (lost)
12000 (lost)
13000 (lost)
14000 (lost)
15000 10000 15000 16000
16000 10000 15000 17000
17000 10000 15000 18000
3 ACKs trigger fast retransmit
10000 (lost again) *
11000 10000 11000 12000 15000 18000
12000 10000 11000 13000 15000 18000
13000 10000 11000 14000 15000 18000
Once the ACK + SACK options indicate that the leftmost hole is not shrinking,
while the SACKed octets are increasing (to deal with clients which send one
retransmission segment and one new segment interspaced, or when multiple holes
exist which are being filled, or when network reordering occurs, or when some
wore segments get lost again):
Reset the Rexmit vector to the beginning of the Hole-List (SND.UNA), clear
counter to count duplicates (just in case one segment gets lost again during
retransmit), and keep the DUPACK detection logic armed...
Also, this reaction should not occur before 1 RTT - so ACKs subsequent to the
three which indicated the "lost again" segment will take care (in the typical case)
that no segments are retransmitted needlessly. ACK processing has to occur before
deciding which segment (retransmit / new) to send next. Holes will then be marked
fully retransmitted, before the 2nd retransmission round would advance to them).
10000 14000 15000 18000
14000 18000
(normal transmit resumes, but with cwnd shrunk by 2 congestion events)
18000 19000
19000 20000
And yes, I was unclear with the use of the terminology; I should have probably
stated "pipe" instead of "cwnd" below, as cwnd is not touched during LimitedTransmit /
FastRetransmit...
Richard Scheffenegger
Field Escalation Engineer
NetApp Global Support
NetApp
+43 1 3676811 3146 Office (2143 3146 - internal)
+43 676 654 3146 Mobile
www.netapp.com
Franz-Klein-Gasse 5
1190 Wien
-----Original Message-----
From: Alexander Zimmermann [mailto:zimmermann at nets.rwth-aachen.de]
Sent: Montag, 9. November 2009 16:26
To: Scheffenegger, Richard
Cc: tcpm at ietf.org Extensions WG
Subject: Detect Lost Retransmit with SACK
Hi Richard,
firstly welcome on the list :-)
Since your question in not really related to the poll I change the title...
Comments inline.
Am 09.11.2009 um 13:57 schrieb Scheffenegger, Richard:
>
>
> Hi Alexander et al.,
>
> This will be the first post to this group, so excuse me if I act
> inappropriately.
>
> I'm curious about one little tidbit which has been bugging me for the
> better part of the last two monts, and which is closely related with
> TCP SACK operations (thus it might belong to this thread?)
>
>
> The implicit assumption for TCP fast recovery is, that packet loss
> happens randomly (ie. to different segments each time) with low
> correlation between the drop events. Also, a drop event is used as a
> implicit signal to indicate congestion. So far, so good.
>
> It seems to me, that the focus of most developments has been the
> internet environment - where statistical assumptions like the above
> mentioned arguably hold true.
>
> However, certain high-speed LANs seem to exhibit characteristics,
> which don't play well with these implicit assumptions (uncorrelated
> packet loss) - the smaller the network, the more deviation from an
> "good seasoned" link (exhibiting some form of congestion) is likely
> to occur.
>
> Also, as has been noted in prior research, many internet routers do
> use more "tcp-friendly" RED or WRED queue policies, over the
> simplistic TailDrop most often encountered in LANs (default policy
> of L2 switches and L3 routers).
>
> In one extreme, I have found a (misbehaving´?) TCP stack/host, which
> sends out a burst of segments (4-6) @ 10GbE wirespeed, which
> immediately cause queue buffer overload and TailDrop in the first
> hop L2 Switch, when two such high performance hosts try to establish
> a high speed communication. With other words, the hosts themselves
> seem to make sure that there is a high correlation between TCP
> (fast) recovery and further packet loss.
>
>
> But what puzzles me the most - even with SACK enabled TCP stacks,
> virtually no implementation can detect / act upon detection of the
> loss of a retransmitted segment during fast recovery. This despite
> the fact, that the stipulations in RFC3517 requires the receiver to
> make the information to detect such an event implicitly available to
> the sender. The first SACK option has to reflect the last segment,
> which triggered this SACK.
>
> Together with the scoreboard held at the sender, it should be rather
> easy to find out if the left edge of the lowest hole (relative to
> stream octets) closes.
What do you with "left edge of the lowest hole"? Do you mean SND.UNA?
If ACK covers SND.UNA then it is an cumulative ACK.
>
> If that left edge stays constant for "DupThresh" number of ACKs,
> which reduce the overall number of octets in holes (any one hole
> might close due to the retransmitted packets still received), AND
> the sender retransmits beginning with the lowest hole first, this
> would be a clear indication of another segment retransmit loss...
Sorry, I don't understand. If we have 20 segments in flight and one
segment gets lost, you will retransmit after 3 DUPACKS the oldest
outstanding segment.
Then, assuming no reordering and no further lost, you will get 17
DUPACKS (without Limited Transmit) before your hole is closed.
What do I miss here?
Can you give me an example?
>
> Even a less speedy detection logic would work for SACK-enabled
> sessions: once the fast recovery is finished from the sender's point
> of view, if the receiver still complains about missing segments
> (indicated by having the SACK rightmost edge - in the first slot
> SACK option - at a segment higher than when fast recovery started),
> another round of fast recovery could be invoked, rather than waiting
> for RTO.
>
> Of course, the first approach would be better for low cwnd sessions
> with only very few segments in transit - and both could be combined
> with the proposed sack recovery speed-ups... (Reducing DupThresh for
> low cwnd sessions / when little data is being sent).
>
>
>
> Congestion control should act to this event (it will now, but only
> one RTO later...), and the SACK retransmit vector (HighRxt) reset,
> using LimitedTransmit for sending out the retransmission segments -
> once cwnd + pipe allows; any retransmitted segments still in the
> network will close their respective SACK holes before the new
> HighRxt advances to them.
>
> And, RTO should be reduce (I guess to nearly zero, between SACK-
> enabled hosts).
>
>
> I have run numerous tests, to check the behavior of different TCP
> Stacks (FreeBSD 4.2 - 8.0; windows xp, vista, 7, 2003; Linux 2.6.16
> and others).
>
>
> All these stacks seem to exhibit this issue; What I don't know yet
> is the percentage of multi-loss segement events triggering RTO - but
> I assume that the majority of RTOs happen because of this.
>
> In LAN environments (ie. 10 GbE over 1 km @ 2 ms latency due to the
> L2 hops in between) featuring relatively few streams, the effect of
> any single RTO can be quite tremendeous - taking considerable
> theoretical bandwidth away from the session (ie. 1 sec minimum RTO
> equals 1.2 GB; even with more recent RTO values around 0.2 - 0.4
> sec, each RTO is still a few hundred MB "lost" capacity under
> optimal circumstances.
>
>
> Nevertheless, I cann't imagine that I am the first one to bring up
> this issue (despite having failed to find any study of this
> effect). :)
>
>
> One more clarification, which came up after I looked at the FreeBSD
> implementation of Limited Transmit; this might be a nit-pick, but
> when RFC 3042 is active, shouldn't ABC also be used during
> LimitedTransmit / FastRecovery?
Why? One reason for ABC are lying receivers (ACK Division). So, the
worst case is Slow-Start...
> (FreeBSD MAIN is increasing cwnd by 1 mss for each new ACK, instead
> for the amount of data in that ack...
What do you describe here? Slow-Start?
RFC 3042 says: "The congestion window (cwnd) MUST NOT be changed when
these new segments are transmitted."
>
> Thanks a lot!
>
>
> Best regards,
Alex
>
>
>
> Richard Scheffenegger
> Field Escalation Engineer
> NetApp Global Support
> NetApp
> +43 1 3676811 3146 Office (2143 3146 - internal)
> +43 676 654 3146 Mobile
> www.netapp.com <BLOCKED::http://www.netapp.com/>
> Franz-Klein-Gasse 5
> 1190 Wien
>
> * To: "tcpm at ietf.org <mailto:tcpm at DOMAIN.HIDDEN> WG Extensions"
> <tcpm at ietf.org <mailto:tcpm at DOMAIN.HIDDEN> >
> * Subject: [tcpm] Should draft-ietf-tcpm-sack-recovery-entry update
> RFC 3717 (SACK-TCP)
> * From: Alexander Zimmermann <alexander.zimmermann at nets.rwth-aachen.de
> <mailto:alexander.zimmermann at DOMAIN.HIDDEN> >
> * Date: Wed, 21 Oct 2009 12:22:50 +0200
>
> _____
>
> Hi folks,
>
> based on the fact that the draft "draft-ietf-tcpm-sack-recovery-
> entry" is adopted as WG item now and intended to be a "standards
> track" document, I would like to start a poll/discussion whether the
> draft should update RFC 3517 or not? Moreover, should we produce a
> separate document or an update of RFC 3517?
>
> a) separate document, do not update RFC 3517
> b) separate document, update RFC 3517
> c) RFC3517bis, obsolete RFC 3517
>
> //
> // Dipl.-Inform. Alexander Zimmermann
> // Department of Computer Science, Informatik 4
> // RWTH Aachen University
> // Ahornstr. 55, 52056 Aachen, Germany
> // phone: (49-241) 80-21422, fax: (49-241) 80-22220
> // email: zimmermann at cs.rwth-aachen.de
> // web: http://www.umic-mesh.net
> //
>
>
> _______________________________________________
> tcpm mailing list
> tcpm at ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm
//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22220
// email: zimmermann at cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//
_______________________________________________
tcpm mailing list
tcpm at ietf.org
https://www.ietf.org/mailman/listinfo/tcpm
Note: Messages sent to this list are the opinions of the senders and do not imply endorsement by the IETF.