Re: [tcpm] Detect Lost Retransmit mit SACK

"Scheffenegger, Richard" <rs@netapp.com> Sun, 15 November 2009 02:16 UTC

Return-Path: <rs@netapp.com>
X-Original-To: tcpm@core3.amsl.com
Delivered-To: tcpm@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 85B963A67D1 for <tcpm@core3.amsl.com>; Sat, 14 Nov 2009 18:16:42 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.11
X-Spam-Level:
X-Spam-Status: No, score=-4.11 tagged_above=-999 required=5 tests=[AWL=-1.400, BAYES_05=-1.11, J_CHICKENPOX_26=0.6, J_CHICKENPOX_33=0.6, J_CHICKENPOX_34=0.6, J_CHICKENPOX_83=0.6, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TfPdfDYgZEya for <tcpm@core3.amsl.com>; Sat, 14 Nov 2009 18:16:40 -0800 (PST)
Received: from mx3.netapp.com (mx3.netapp.com [217.70.210.9]) by core3.amsl.com (Postfix) with ESMTP id CF81C3A6784 for <tcpm@ietf.org>; Sat, 14 Nov 2009 18:16:39 -0800 (PST)
X-IronPort-AV: E=Sophos;i="4.44,744,1249282800"; d="scan'208";a="104084379"
Received: from smtp3.europe.netapp.com ([10.64.2.67]) by mx3-out.netapp.com with ESMTP; 14 Nov 2009 18:17:09 -0800
Received: from amsrsexc1-prd.hq.netapp.com (webmail.europe.netapp.com [10.64.251.107]) by smtp3.europe.netapp.com (8.13.1/8.13.1/NTAP-1.6) with ESMTP id nAF2H93x015782; Sat, 14 Nov 2009 18:17:09 -0800 (PST)
Received: from LDCMVEXC1-PRD.hq.netapp.com ([10.65.251.108]) by amsrsexc1-prd.hq.netapp.com with Microsoft SMTPSVC(6.0.3790.3959); Sun, 15 Nov 2009 03:17:09 +0100
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Sun, 15 Nov 2009 02:15:19 -0000
Message-ID: <5FDC413D5FA246468C200652D63E627A066BDBEE@LDCMVEXC1-PRD.hq.netapp.com>
In-Reply-To: <4AFA79BD.4090200@i4.informatik.rwth-aachen.de>
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Thread-Topic: [tcpm] Detect Lost Retransmit mit SACK
Thread-Index: AcpirAr3zK/uxzYbRgiHMTMGRHK5agC5WJiw
From: "Scheffenegger, Richard" <rs@netapp.com>
To: Arnd Hannemann <hannemann@i4.informatik.rwth-aachen.de>, tcpm@ietf.org
X-OriginalArrivalTime: 15 Nov 2009 02:17:09.0215 (UTC) FILETIME=[BC024EF0:01CA6599]
Subject: Re: [tcpm] Detect Lost Retransmit mit SACK
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/tcpm>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 15 Nov 2009 02:16:42 -0000

Hi Arnd,

Sorry for the late response. I did some further investigations 
lately, and this is what I came up with.

Of course, the majority of work and thoughs have been done by 
others (most notably, I want to cite [[1]] "Lost Retransmission 
Detection for TCP Part 2: TCP Using SACK Option" by 
B. Kim et al (2004), but also by some ideas from [[2]] M. Mathis 
(dated) RateHalving NetBSD based code found here 
http://www.psc.edu/networking/projects/rate-halving/ 
(which did something similar but only for a single segment  
for the SACK recovery generation, if I can tell...)

First, the following prerequisites are required:

1) TCP Session must be SACK-enabled
2) TCP Sender has to have FACK implemented 
3) TCP Sender has to have a scoreboard implementation to 
   track the state of the receiver, working with a sorted 
   list of SACK-Holes rather than positive SACK-Ranges.


The fix might not be mathematically perfect in tracking each single 
segment, but in it's simplest incarnation, it's about 8 lines of 
code, adding minimal constant time (!) during ACK-processing. 

The idea is basically, not to track every sent out segment 
individually, but rather all holes in the scoreboard at once.

The implicit assumption here is, that retransmissions get sent out
in order (and are received with little to no reordering, or not 
at all).

When the sender transmits the last segment to "close a hole" in the 
sender's representation of the scoreboard, represented by a data 
structure holding start, end and rxmit vector, ie. When rxmit == end, 
rxmit will be assinged the current value of snd.max. (In the simple 
form, a separate variable iskept, but for the purpose of this 
algorithm, the condition of rxmit <= end vs rxmit > end can be used 
to differentiate between the two uses (current retransmission vector 
in a hole, vs. segment number by which the hole should be reported 
closed by the receiver) of rxmit.

So, after the retransmission of the last segment of a hole, it's 
rxmit will show the expected snd.fack value when that hole is ought 
to have been properly ACKed by the receiver.

In SACK processing, some precautions have to be made during hole-
splitting (to keep the higher than sackhole.end rxmit vector). 

Normal SACK processing will remove any fully closed holes before 
snd.fack becomes greater than sendhole.rxmit, except when one of 
the segments within that hole is never seen by the receiver. While 
processing a normal SACK, check the condition of 
snd.fack > sackhole[0].rxmit; 

In the worst case (1 segment hole), when some reordering occurred, 
this will yield one superfluous segment; with larger holes, the 
intrinsic reordering grace period granted is n-m segments (with n 
the size of the hole, and m the segment within the hole).

If that condition is met, reset sackhole[0].rxmit to 
sackhole[0].start, and also update the shortcut lookup pointer 
(sackhint.nexthole to sackhole[0]).


Extended periods of heavy loss are not dealt with. According to 
[[1]], these should be counted as additional congestion events, 
reducing ssthresh and cwnd accordingly; but this method needs a 
constant stream of at least some new segments to leave the 
Network).
Thus, RTO + slow start is required for recovery of such severe
Events.

Also, when there is no more data from the socket to transmit, 
to never exceed snd.max recorded for that hole, it will not 
work. (Again, RTO to the rescue).

But, it will update the rxmit vector for each fully retransmitted 
hole - as long as it keeps going. This yields a maximum number or 
RTO/RTT retries, before "usual" recovery sets in, and should 
cover the common case.


You could say it's more of a quick hack, but I think the 
simplicity, intrinsic properties and near no overhead (processing 
time or variable overhead) are well balanced, but if my assumptions
are incorrect or short sighted, please say so.


The simple version without any tweaks in existing checks of 
sackhole[]->rxmit (requiring one more tcp_seq var in sackhole[]):

Tcp_sack_doack:

(after sanity checks about the new SACK blocks, BEFORE new snd.fack 
has been calculated (one additional ACK grace period):
 
	if (((temp = TAILQ_FIRST(&tp->snd_holes)) != NULL) && //###
	     (SEQ_GT(tp->snd_fack, temp->limit))) {           //###
			temp->rxmit = temp->start;                //###
			temp->limit = tp->snd_max;                //###
			tp->sackhint.nexthole = temp;             //###
		}
//###

Tcp_sack_doack:

When splitting a hole in two:

		temp = tcp_sackhole_insert(tp, sblkp->end,
				    cur->end, cur);
				if (temp != NULL) {
					if (SEQ_GT(cur->rxmit,
temp->rxmit)) {
						temp->rxmit =
cur->rxmit;
	
tp->sackhint.sack_bytes_rexmit
						    += (temp->rxmit
						    - temp->start);
					}
					temp->limit = cur->limit;
//###
					cur->end = sblkp->start;
					// RS: SACK+ mod?
					cur->rxmit = SEQ_MIN(cur->rxmit,
					    cur->end);
				}


Tcp_output:

(when in Retransmit && SACK mode):

If (sack_rxmit == 1) {
		th->th_seq = htonl(p->rxmit);
		p->rxmit += len;
		if (SEQ_GEQ(p->rxmit, p->end))
//###
			p->limit = tp->snd_max;
//###
		tp->sackhint.sack_bytes_rexmit += len;
}

Tcp_var.h:
struct sackhole {
	tcp_seq start;		/* start seq no. of hole */
	tcp_seq end;		/* end seq no. */
	tcp_seq rxmit;		/* next seq. no in hole to be 
                                 retransmitted */
	tcp_seq limit;          /* ### for lost retransmission 
                                 detection - should be rolled 
                                 into rxmit */
	TAILQ_ENTRY(sackhole) scblink;	/* scoreboard linkage */
};


Richard Scheffenegger
Field Escalation Engineer
NetApp Global Support 
NetApp
+43 1 3676811 3146 Office (2143 3146 - internal)
+43 676 654 3146 Mobile
www.netapp.com 
Franz-Klein-Gasse 5
1190 Wien 

 

> -----Original Message-----
> From: Arnd Hannemann [mailto:hannemann@i4.informatik.rwth-aachen.de] 
> Sent: Mittwoch, 11. November 2009 09:46
> To: tcpm@ietf.org
> Subject: Re: [tcpm] Detect Lost Retransmit mit SACK
> 
> Hi Richard,
> 
> Scheffenegger, Richard schrieb:
> >>>   A1) With TSOpt, they could be used to "remember" when 
> >>>       Retransmission started;
> >>>       When an ACKs with TSEcn > (TSOpt + RTT) is seen by 
> >>>       the sender, it can
> >>>       re-arm the DUPACK detector.
> >>>       
> >> This is your non-SACK scenario?
> >> Please note that a TCP receiver will NOT echo TSVal from 
> out-of-order 
> >> segments. So this won't work.
> >>     
> >
> > Interesting; this doesn't seem to be what certain (most?) 
> stacks are 
> > doing - they seem to always reply in the ACK Tsecr the value of 
> > Tsopt/TS.recent triggering the ACK...
> 
> Please clarify, is this non-SACK scenario?
> The only lost retransmission we could be talking about then, 
> is the retransmission of SND.UNA.
> And the TCP receiver will NOT adjust TS.recent for any out-of 
> order packets (RFC 1323).
> It will continue to echo the timestamp of the last segment 
> which advanced the window (SND.UNA -1p).
> 
> 
> > Or was your intention to say, that the receiver will not 
> respond with 
> > Tsopt, but rather TS.recent?
> 
> Yes, it will not respond with TSVal but with TS.recent.
> 
> > If so, than this makes no difference - the sender will detect the 
> > first ACK from the receiver, which has the same TS (well, actually 
> > TS+1 as multiple segments are likely to carry the same TS) when the 
> > retransmission
> 
> There won't be an ACK if the retransmission is lost in 
> non-SACK case, only dupACKs.
> (Only retransmission is retransmission of SND.UNA) TS.Recent 
> will not get updated.
> 
> 
> > was started. This will be the sign for the sender, that at 
> least one 
> > RTT has elapsed (that was the intention for having 
> timestamps in the 
> > first place, right?). If the returned ACK does not cover the first 
> > retransmitted segment by that time, it's quite reasonable to assume 
> > that it got lost again
> >   
> 
> > Again, keep in mind that I'm looking on LANs; a TS in 10 ms 
> increments
> 
> If you are searching for something which is not intended for 
> general use, why not just set MIN_RTO to say 10 ms?
> 
> >>>   A2) With SACK, tracking SND.NXT and HOLEBYTES (number of 
> >>>       octets in all current holes)
> >>>       can be employed to track RTT (they need to be 
> >>>       initialized when the first
> >>>       retransmitted segment is sent). When an ACK contains 
> >>>       a SACK for >= SND.NXT, or the
> >>>       HOLEBYTES are smaller than when retransmission started, the 
> >>>       DUPACK detector can be re-armed.
> >>>       
> >> You should specify what you mean with SND.NXT.
> >> We are in recovery, so we may very well send out new segments.
> >> I don't see why number of holes in sack scoreboard has 
> anything todo 
> >> with RTT.
> >>     
> >
> >
> > SND.NXT is the rightmost segment which the sender has never before 
> > sent out to the network. Isn't that the meaning of Snd.nxt 
> according 
> > to RFC793?
> 
> Hmm, I would interpret RFC793 differently.
> I would call it SND.MAX (see RFC4015) or HighData (RFC3517) 
> the rightmost segment which the sender has never sent before. 
> But now you have to clarify how it would be possible to 
> receive a SACK block > SND.MAX. Even SACK block == SND.MAX 
> will be rarely seen.
> But before taking this discussion even further, maybe take 
> some time and write some preliminary draft of your algorithm.
> 
> > And I was not talking about the # of holes, but rather the 
> amount of 
> > Data (in sum) between FACK (defined by the rightmost offset in the 
> > SACK Scoreboard, i.e. covering the highest ever received 
> segment) and 
> > SND.UNA.
> >
> > Monitoring Holes will gain nothing (when there is reordering for 
> > example);
> >
> > But if the any one retransmission makes it though, the amount of 
> > unSACKed data in the sum total of all holes in the scoreboard, will 
> > decrease - indicating when at least one RTT has passed 
> since start of 
> > FastRetransmit.
> 
> This is not true.
> Consider a 100 packet in flight, you get 3 dupacks, do  a 
> retransmission.
> This is in your words "start of FastRetransmit". After this 
> the sum total of all holes between SND.UNA and FACK can only 
> increase for one RTT.
> Due to Limited Transmit and Fast Recovery, it might be 
> possible that even the sum total of all holes between SND.UNA 
> and SND.MAX will increase.
> You know when the first Retransmit is lost when a SACK covers 
> RecoveryPoint (RFC3517). But this is only true for first hole.
> To detect lost retransmissions of the other holes, you would 
> need to store a separate HighData (RFC3617) for every retransmission.
> Maybe this is after all what you are trying to express?
> Perhaps it would be helpful if you could cook up some pseudo 
> code, of your proposed algorithm.
> 
> >>     
> >>> holes
> >>>    have been retransmitted once... 
> >>>       
> >> Show me an example. where a lost retransmission is 
> detectable, while 
> >> the sender did not send out retransmissions for every other hole?
> >> Sounds weird.
> >>     
> >
> > Can I attach trace files on posts to this list, or shall I 
> sent them 
> > to you directly?
> 
> Never mind, does not actually matter.
> 
> >>>       
> >> This probably highly rates from case to case. If you have a huge 
> >> bandwidth-delay product even SACK scoreboard handling will have 
> >> computational complexity which might outweigh its benefits.
> >> As you have seem to have the equipment at hand, you could 
> do some cpu 
> >> load and throughput measurements with 10GbE and SACK 
> versus non-SACK 
> >> flows.
> >>     
> >
> > Well, I was doing some test like this:
> >
> > 
> http://www.ibm.com/developerworks/linux/library/l-tcp-sack/index.html
> >
> > But the CPU impact (on top of what is required for 10G) was 
> neglegible.
> > But I have to admit, I didn't had the time yet to test a SACK 
> > scoreboard with every other byte marked mission (which 
> could mean some 
> > 0,5 mio entries).
> >
> > But it seems that most implementations feature sorted 
> scoreboards of 
> > limited
> > (16/128) size.
> >
> > Traversing those is - again in my case of 10G LANs - still 
> many orders 
> > of magnitude faster then waiting for RTO.
> >
> > I personally consider RTOs to be harmful (when there are 
> other means 
> > to recover with). :) (perhaps once I have worked through all the 
> > issues raised here, I submit a draft with "TCP RTO 
> considered harmful" 
> > next April? :)
> 
> Better submit a draft to mitigate the problems with it ;-)
> 
> Best regards,
> Arnd
> 
> 
> 
> 
> 
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm
>