TCP Maintenance and Minor Extensions (TCPM) Working Group
November 11, 2010 - IETF 79 - Beijing, China
=========================================================

Thanks to note-takers Ari Nyrhinen and Yi Ding!

WG Status - presented by the chairs (David Borman)
---------

2 WG documents are near publication (LCD and urgent data)

All information on other documents was sent to the mailing list and in slides that
we won't spend much time on in the meeting.

TCP Security
- In Anaheim, there was a separate session on this (6-8 people) looking at things
  one by one and put together rules for consistent reviewing
- Only 3 people came to the meeting for this in Beijing.
- Fernando will put a revision together similar to RFC 1122, so the recommendations
  will be clear and separate from the discussion (goal is to have it by mid-December)
- The best way forward is to add all the pieces to the WG document,

Lars: When the long original document was brought to the working group, it was clear
that the WG did not want to publish it as-is.  Yes, some local participants might have
made this disapproval sound larger than it was, but I did not feel that there was even
a small minority that wanted to accept it as-is.
Fernando is willing to keep editing the document, but I have some doubts that we
have the energy in the WG to finish this and it's going to lead to frustration for
the author and the WG.  I really wonder if you can finish this document.

(Fernando is not here - nobody sees him in the room)

David: Fernando indicated he's willing to continue.

Lars Eggert: We've had this document for a year and there hasn't been a lot of forward
progress on it; I think you chairs need to watch it and decide whether the WG has
the energy to finish it.

Chairs: either this works, or this is just not going to happen in this WG.

Mahesh Jethanandani: I'm glad you guys are doing this; I'll agree to review the document

Chairs: we're looking into reviewing sections of the document by individuals instead
of the whole document; it's a huge document and the whole review would be too big a
task.

Wesley: reminder when Matt said that if we don't do this document, someone else will,
and if we don't have expert reviewers carefully check each recommendation, they may
leave us unable to do good things later.

(flipped through rest of WG items, and other drafts, but didn't discuss these in order
to keep the meeting moving)

A Testbed Study on IW10 vs IW3 - presented by Jerry Chu
------------------------------

This is the 3rd time supporting data has been presented at IETF meetings; this time
is with data from a controlled testbed.

A summer intern (Yaogong Wang) produced the data; now he's back at school (North Carolina
State University), but continues to collaborate, and has a web page up.

Fred Baker asked how IW10 performs on a very slow link (e.g. 56-64 kbps).  It's hard
to get microscopic insight into flow behavior from the datacenter measurements; the
testbed tests can do better.  Poisson arrival is used to emualte a browser.

Linux's SACK is more advanced than most; we also have results if SACK is not used.

A number of bugs effect the results to some extent; many knobs have to be tuned.

Lars: have you only tested with Linux?
Yes
Murari Sridharan: We were hoping to have some results this time around; the problem is
that in Windows 7, the IW is not configurable, so we'd have to give you a special
version; we can give you a modified version.

(continuing with description of parameters)

Lars: The queue length used seems to be really huge.
Jerry: that number was given by Fred Baker
Lars: What would happen if you set it to something like 10 or 5?
Jerry: if it is too small, performance is terrible
Lars: right, but here the delay is terrible

(continuing with description of test cases)
Google initially wanted to optimize completion time; people had concerns about
background flows

Lars: When you showed this the first time in Anaheim, the completion time metric was
started when the GET was received, has this changed now, because I'm worried that we
lose the SYN-ACK exchange from the time

Scott Bradner: The last time the IW was increased, there was a lot of research done to
see how it effected dialup modems; there are still some percent of users behind modems,
so it's not a zero case; have you figured out what their buffering is?

Jerry: No, but we have results that show the completion time is not much worse

Lars: if you are using netperf, then it includes the SYN-ACK exchange, so now it is
there

Jerry: No doubt IW=10 increases packet loss rate; with extreme load it's a lot higher,
but regardless of the packet loss rate, IW=10 seems to improve and at least not hurt

No serious fairness problem was detected between IW=3 and IW=10 flows

Ilpo Järvinen (back to slide 10): I wonder about the packet loss numbers; have you
figured out how much of the losses happened for the flow itself and how much in effects
the other flows?

Jerry: good question, no we don't have that correlation

Murari: regarding SACK, back to Lars's point; I'm worried about the Linux SACK.  When
SACK is enabled, we need to be worried about the recovery of lost retransmissions

Jerry: We also have tests with SACK disabled

(slide 11)

Idris Rai: So you observed increase in loss rate, and at the same time queue delay is 
very long. i think you should have qlen 20 packets instead of 40 packets? if that was
the case, would the loss rate be higher? would the completion time change? main question
is, your queing delay is very long. if you make it shorted, you'll observe more losses,
which would have an increase in the completion time due to more timeouts.  It's
difficult to draw a conclusion

Jerry: I'm not saying IW10 always improves response time

Ilpo: I can confirm that this very low bandwidth type of links have the case that IW3
is so aggressive it causes too much problems on itself. the larger IW10 problems are
hidden by the already serious problems of IW3, that's why the numbers don't always seem
to be as serious as we'd expect.

Andrew : I have a question: seeing these numbers reminds me that the traffic shaping code
is very sensitive to how the kernel is timeslicing, so can you tell me if you are using
a tickless kernel with a good high resolution time source or a ticking kernel. 

Jerry: we have two different testbeds; I don't remember if tickless and highres is
enabled

Andrew: I'm concerned about the really slow links; the ticking kernels really bite you
because of the resolution

Murari: are you saying by default the linux kernel paces packets?

Jerry: tickless timer resolution (interrupted)

Murari: given the initial burst is not timer driven, is Linux pacing.

Andrew: i'm talking about the simulation of the 64kbit link, and the router is my
concern, not on the clients.

Jerry: there is a known problem with the older kernel

Andrew: it's more about the complete config and whether it works with tickless, so that's
my main concern.  It's more complicated than kernel version.

(Jerry and Andrew decide to talk offline to keep moving along)

(slide 12 .. 13)

IW=10 has a long tail, figures show queue length increases...

(slide 14)

At high load, IW10 advantage goes away because the higher queuing delay eats the
advantage.

Lars: I'm really disappointed that it is so hard to look at the data. what are the
colored bars on the left for example. 

Jerry: the numbers are a client-server pair.

Lars: is it the average you are showing on the bars? what are the error bars on those
averages? the four bars?

Jerry: Left two are IW3 the right two are IW10, left is IW3, mid is IW10, right is a
mixture. go to the website, you can really see all the combinations, I'm just listing
the observations. 

(slide 16)

Andrew: IW10 has 10 opportunities to lose a packet, IW3 has only 3. therefore the median
number of losses for IW10 is higher.
Jerry: I need to talk to you!

(slide 17 and 18)

Background flows better under IW10 than IW3.  If you have explanations to why this
happens, it'd be great.

My explanation is IW10 is more efficient, so there are more gaps for other flows.

Lars: you said a 64kbits bottleneck? do you see the same for the faster bottleneck?
Jerry: check the website ... I think it is much the same. but it does not show any
damage. under very high load, the background flow sometimes stop forever (timeout).
This happens more under IW=10.
SACK does help reducing UCT but only by a small percentage for both IW=10 and IW=3.

Yoshimuni Nishida: is it possible to get raw data for the simulation.  I only see
processed data. 

Jerry: the tcpdumps I'm not sure; I can ask my intern.  If you volunteer to process raw
data, I'd be delighted. I'll try to make tcpdumps public.


Effect of IW and Initial RTO Changes presented by Ilpo Järvinen
------------------------------------

With a huge number of concurrency, the benefit of IW=10 disappears.
The bottom line: both IW=3 and IW=10 are in trouble.

(slide 5)

In the worst cases, IRTO=1 is able to help a bit, but an opposite effect for the first
starting burst.

(slide 6..7..8)

These numbers really are real, the inital RTO of 1 adds fairness because it changes the
positions slightly due to an early occurring RTO.

(slide 9)

Two configs: RED based on recommended parameters from the web pages, and REDok aimed for
more variable load, made aggressive enough to affect slow start.
REDok is very much different from the defaults.

The intent was to keep operating within the RED range in REDok, and not to fall back to
FIFO behavior

(slide 10)

on the queue lengths. the steep, large buffer lines peak at about 4 seconds.
a low load case. it seems that for the very high end, IW=10 is more aggressive but for
the rest, it is less aggressive. the theory is that IW=10 causes a lot of congestion for
itself, so it needs to recover more. IW=3 keeps flows more constantly going. 

(slide 11)

yellow-vs-cyan line shows that in the higher end, REDok cannot control the load IW=10
causes.

(slide 12)

red vs. green: IW=10 self-congestion makes it more bursty and less aggressive.

(slide 13)

Area between cyan and yellow lines expresses the unresponsiveness of IW=10.  There are
packets arriving from "nowhere", so RED can do nothing to control that until the packets
are already in the queue.

(slide 14 .. 15)

Self-congested cases for IW=10 make it less aggressive than IW=3.

(slide 16... questions?)

Lars: for both Jerry and Ilpo.  I must admit I don't see a clear story yet, a case for
under which conditions we can make it go forward. We can see cases where there is no
downside, but there are also cases where it causes problems or at least does not help.
There needs to be a story on where it helps.

Jerry: the difficulty here is, it is a tradeoff. our numbers show that with faster links
the improvement is dominant, we hardly find any downside. With extreme cases there are
problems, some people seem to think these cases matter.

Ilpo: to me it seems there are clear extremes where we know what happens, but then there
is the middle ground that is very hard to find and to figure out a clear boundary where
it helps or not

Jerry: there is real damage in terms of increased retransmissions; in TMRG, some people
think an increase of 1 or 2% is catastrophic.  I don't think we have a consensus on this.

Lars: one problem is, it is really hard to say even that ok we can do this over
broadband, because the sender does not know if the client is on a broadband link. You
have done a good job at gathering data.
For the wg, we need to decide where to put the bar, having more and more data is not
really helping us decide any better. that is what we should be talking about. At what
point are we willing to take this forward and under which scenarios.

Jerry: can use advertised receive window for that problem
Murari: but that's (at least in Windows) based on interface speed
Andrew: people with Linux routers are going to be changing receive windows on the
fly too (for highly congested slow links), like our friend from Uganda


Data Center TCP (DCTCP) - presented by Murari Sridharan
-----------------------

(slide 3.... 8)

Jerry: jittering seemed to solve the problem?

Murari: all apps need to change, but yes. jitter is a solution, but a very app-specific
one.

Challenge is to get low latency for short flows and high throughput for long ones.
Buffers significantly increase the cost, and directly impact latency.  Shallow
buffers can't absorb bursts.

(slide 14)

RED is pretty much useless for these scenarios

Ilpo: i would not put it like that.  With default parameters RED certainly is useless,
but maybe it could be configured to be more aggressive.

Andrew: A shallow-buffered switch suffering from incast is actually running out of
buffers to do the enqueue

(slide 18)

configure RED so that max and min thresholds are the same, so the minute you go beyond
that, it marks
TCP also needs to cut proportional to the losses

Ilpo: I think you should also configure the RED w_q to be one
Murari: yes.

Instead of always cutting by half; you cut by half the probability.

(slide 19 .. 21)

Jerry: i still dont quite understand how this solves the incast problem. too many incasts
can still overflow the buffer.
Murari: that is true; queue size is kept small and latency improves for small flows.
Andrew: it works because there is now free buffer space.

(slide 27)

Murari: commodity switches do ECN marking. the layer violation is already done.

(end of slides)

Jerry: have you ever tried delay-based algorithms?
Murari: at 10 Gbps, there's so much noise in the delay measurement from other things
(intermodulation example) the RTT and the processing are on the same timescale
Jerry: do you have plans to move this forward?
Murari: the only confidence we have is in a datacenter environment this works, but we
feel that there is something here we should consider in the Internet.

Andrew: something like this could work on much slower timescales; I've seen relatively
slow wireless links that suffer from incast.

Lars: on eventually taking this to the IETF, since you are building on ECN and ECN is
really not deployed in the internet, so you could use failure to negotiate ECN as a
trigger to turn this off.

Murari: my question is, should I write an I-D on this or should or submit that to TCPM
or ICCRG?

Spencer Shepler: for NFSv4 all these problems exist, so getting that documented in an
I-D at a minimum would be useful.
Andrew: I would prefer to change the defaults in RED to this because i feel they are the
only reasonable defaults.