TCP Maintenance and Minor Extensions (TCPM) Working Group November 11, 2010 - IETF 79 - Beijing, China ========================================================= Thanks to note-takers Ari Nyrhinen and Yi Ding! WG Status - presented by the chairs (David Borman) --------- 2 WG documents are near publication (LCD and urgent data) All information on other documents was sent to the mailing list and in slides that we won't spend much time on in the meeting. TCP Security - In Anaheim, there was a separate session on this (6-8 people) looking at things one by one and put together rules for consistent reviewing - Only 3 people came to the meeting for this in Beijing. - Fernando will put a revision together similar to RFC 1122, so the recommendations will be clear and separate from the discussion (goal is to have it by mid-December) - The best way forward is to add all the pieces to the WG document, Lars: When the long original document was brought to the working group, it was clear that the WG did not want to publish it as-is. Yes, some local participants might have made this disapproval sound larger than it was, but I did not feel that there was even a small minority that wanted to accept it as-is. Fernando is willing to keep editing the document, but I have some doubts that we have the energy in the WG to finish this and it's going to lead to frustration for the author and the WG. I really wonder if you can finish this document. (Fernando is not here - nobody sees him in the room) David: Fernando indicated he's willing to continue. Lars Eggert: We've had this document for a year and there hasn't been a lot of forward progress on it; I think you chairs need to watch it and decide whether the WG has the energy to finish it. Chairs: either this works, or this is just not going to happen in this WG. Mahesh Jethanandani: I'm glad you guys are doing this; I'll agree to review the document Chairs: we're looking into reviewing sections of the document by individuals instead of the whole document; it's a huge document and the whole review would be too big a task. Wesley: reminder when Matt said that if we don't do this document, someone else will, and if we don't have expert reviewers carefully check each recommendation, they may leave us unable to do good things later. (flipped through rest of WG items, and other drafts, but didn't discuss these in order to keep the meeting moving) A Testbed Study on IW10 vs IW3 - presented by Jerry Chu ------------------------------ This is the 3rd time supporting data has been presented at IETF meetings; this time is with data from a controlled testbed. A summer intern (Yaogong Wang) produced the data; now he's back at school (North Carolina State University), but continues to collaborate, and has a web page up. Fred Baker asked how IW10 performs on a very slow link (e.g. 56-64 kbps). It's hard to get microscopic insight into flow behavior from the datacenter measurements; the testbed tests can do better. Poisson arrival is used to emualte a browser. Linux's SACK is more advanced than most; we also have results if SACK is not used. A number of bugs effect the results to some extent; many knobs have to be tuned. Lars: have you only tested with Linux? Yes Murari Sridharan: We were hoping to have some results this time around; the problem is that in Windows 7, the IW is not configurable, so we'd have to give you a special version; we can give you a modified version. (continuing with description of parameters) Lars: The queue length used seems to be really huge. Jerry: that number was given by Fred Baker Lars: What would happen if you set it to something like 10 or 5? Jerry: if it is too small, performance is terrible Lars: right, but here the delay is terrible (continuing with description of test cases) Google initially wanted to optimize completion time; people had concerns about background flows Lars: When you showed this the first time in Anaheim, the completion time metric was started when the GET was received, has this changed now, because I'm worried that we lose the SYN-ACK exchange from the time Scott Bradner: The last time the IW was increased, there was a lot of research done to see how it effected dialup modems; there are still some percent of users behind modems, so it's not a zero case; have you figured out what their buffering is? Jerry: No, but we have results that show the completion time is not much worse Lars: if you are using netperf, then it includes the SYN-ACK exchange, so now it is there Jerry: No doubt IW=10 increases packet loss rate; with extreme load it's a lot higher, but regardless of the packet loss rate, IW=10 seems to improve and at least not hurt No serious fairness problem was detected between IW=3 and IW=10 flows Ilpo Järvinen (back to slide 10): I wonder about the packet loss numbers; have you figured out how much of the losses happened for the flow itself and how much in effects the other flows? Jerry: good question, no we don't have that correlation Murari: regarding SACK, back to Lars's point; I'm worried about the Linux SACK. When SACK is enabled, we need to be worried about the recovery of lost retransmissions Jerry: We also have tests with SACK disabled (slide 11) Idris Rai: So you observed increase in loss rate, and at the same time queue delay is very long. i think you should have qlen 20 packets instead of 40 packets? if that was the case, would the loss rate be higher? would the completion time change? main question is, your queing delay is very long. if you make it shorted, you'll observe more losses, which would have an increase in the completion time due to more timeouts. It's difficult to draw a conclusion Jerry: I'm not saying IW10 always improves response time Ilpo: I can confirm that this very low bandwidth type of links have the case that IW3 is so aggressive it causes too much problems on itself. the larger IW10 problems are hidden by the already serious problems of IW3, that's why the numbers don't always seem to be as serious as we'd expect. Andrew : I have a question: seeing these numbers reminds me that the traffic shaping code is very sensitive to how the kernel is timeslicing, so can you tell me if you are using a tickless kernel with a good high resolution time source or a ticking kernel. Jerry: we have two different testbeds; I don't remember if tickless and highres is enabled Andrew: I'm concerned about the really slow links; the ticking kernels really bite you because of the resolution Murari: are you saying by default the linux kernel paces packets? Jerry: tickless timer resolution (interrupted) Murari: given the initial burst is not timer driven, is Linux pacing. Andrew: i'm talking about the simulation of the 64kbit link, and the router is my concern, not on the clients. Jerry: there is a known problem with the older kernel Andrew: it's more about the complete config and whether it works with tickless, so that's my main concern. It's more complicated than kernel version. (Jerry and Andrew decide to talk offline to keep moving along) (slide 12 .. 13) IW=10 has a long tail, figures show queue length increases... (slide 14) At high load, IW10 advantage goes away because the higher queuing delay eats the advantage. Lars: I'm really disappointed that it is so hard to look at the data. what are the colored bars on the left for example. Jerry: the numbers are a client-server pair. Lars: is it the average you are showing on the bars? what are the error bars on those averages? the four bars? Jerry: Left two are IW3 the right two are IW10, left is IW3, mid is IW10, right is a mixture. go to the website, you can really see all the combinations, I'm just listing the observations. (slide 16) Andrew: IW10 has 10 opportunities to lose a packet, IW3 has only 3. therefore the median number of losses for IW10 is higher. Jerry: I need to talk to you! (slide 17 and 18) Background flows better under IW10 than IW3. If you have explanations to why this happens, it'd be great. My explanation is IW10 is more efficient, so there are more gaps for other flows. Lars: you said a 64kbits bottleneck? do you see the same for the faster bottleneck? Jerry: check the website ... I think it is much the same. but it does not show any damage. under very high load, the background flow sometimes stop forever (timeout). This happens more under IW=10. SACK does help reducing UCT but only by a small percentage for both IW=10 and IW=3. Yoshimuni Nishida: is it possible to get raw data for the simulation. I only see processed data. Jerry: the tcpdumps I'm not sure; I can ask my intern. If you volunteer to process raw data, I'd be delighted. I'll try to make tcpdumps public. Effect of IW and Initial RTO Changes presented by Ilpo Järvinen ------------------------------------ With a huge number of concurrency, the benefit of IW=10 disappears. The bottom line: both IW=3 and IW=10 are in trouble. (slide 5) In the worst cases, IRTO=1 is able to help a bit, but an opposite effect for the first starting burst. (slide 6..7..8) These numbers really are real, the inital RTO of 1 adds fairness because it changes the positions slightly due to an early occurring RTO. (slide 9) Two configs: RED based on recommended parameters from the web pages, and REDok aimed for more variable load, made aggressive enough to affect slow start. REDok is very much different from the defaults. The intent was to keep operating within the RED range in REDok, and not to fall back to FIFO behavior (slide 10) on the queue lengths. the steep, large buffer lines peak at about 4 seconds. a low load case. it seems that for the very high end, IW=10 is more aggressive but for the rest, it is less aggressive. the theory is that IW=10 causes a lot of congestion for itself, so it needs to recover more. IW=3 keeps flows more constantly going. (slide 11) yellow-vs-cyan line shows that in the higher end, REDok cannot control the load IW=10 causes. (slide 12) red vs. green: IW=10 self-congestion makes it more bursty and less aggressive. (slide 13) Area between cyan and yellow lines expresses the unresponsiveness of IW=10. There are packets arriving from "nowhere", so RED can do nothing to control that until the packets are already in the queue. (slide 14 .. 15) Self-congested cases for IW=10 make it less aggressive than IW=3. (slide 16... questions?) Lars: for both Jerry and Ilpo. I must admit I don't see a clear story yet, a case for under which conditions we can make it go forward. We can see cases where there is no downside, but there are also cases where it causes problems or at least does not help. There needs to be a story on where it helps. Jerry: the difficulty here is, it is a tradeoff. our numbers show that with faster links the improvement is dominant, we hardly find any downside. With extreme cases there are problems, some people seem to think these cases matter. Ilpo: to me it seems there are clear extremes where we know what happens, but then there is the middle ground that is very hard to find and to figure out a clear boundary where it helps or not Jerry: there is real damage in terms of increased retransmissions; in TMRG, some people think an increase of 1 or 2% is catastrophic. I don't think we have a consensus on this. Lars: one problem is, it is really hard to say even that ok we can do this over broadband, because the sender does not know if the client is on a broadband link. You have done a good job at gathering data. For the wg, we need to decide where to put the bar, having more and more data is not really helping us decide any better. that is what we should be talking about. At what point are we willing to take this forward and under which scenarios. Jerry: can use advertised receive window for that problem Murari: but that's (at least in Windows) based on interface speed Andrew: people with Linux routers are going to be changing receive windows on the fly too (for highly congested slow links), like our friend from Uganda Data Center TCP (DCTCP) - presented by Murari Sridharan ----------------------- (slide 3.... 8) Jerry: jittering seemed to solve the problem? Murari: all apps need to change, but yes. jitter is a solution, but a very app-specific one. Challenge is to get low latency for short flows and high throughput for long ones. Buffers significantly increase the cost, and directly impact latency. Shallow buffers can't absorb bursts. (slide 14) RED is pretty much useless for these scenarios Ilpo: i would not put it like that. With default parameters RED certainly is useless, but maybe it could be configured to be more aggressive. Andrew: A shallow-buffered switch suffering from incast is actually running out of buffers to do the enqueue (slide 18) configure RED so that max and min thresholds are the same, so the minute you go beyond that, it marks TCP also needs to cut proportional to the losses Ilpo: I think you should also configure the RED w_q to be one Murari: yes. Instead of always cutting by half; you cut by half the probability. (slide 19 .. 21) Jerry: i still dont quite understand how this solves the incast problem. too many incasts can still overflow the buffer. Murari: that is true; queue size is kept small and latency improves for small flows. Andrew: it works because there is now free buffer space. (slide 27) Murari: commodity switches do ECN marking. the layer violation is already done. (end of slides) Jerry: have you ever tried delay-based algorithms? Murari: at 10 Gbps, there's so much noise in the delay measurement from other things (intermodulation example) the RTT and the processing are on the same timescale Jerry: do you have plans to move this forward? Murari: the only confidence we have is in a datacenter environment this works, but we feel that there is something here we should consider in the Internet. Andrew: something like this could work on much slower timescales; I've seen relatively slow wireless links that suffer from incast. Lars: on eventually taking this to the IETF, since you are building on ECN and ECN is really not deployed in the internet, so you could use failure to negotiate ECN as a trigger to turn this off. Murari: my question is, should I write an I-D on this or should or submit that to TCPM or ICCRG? Spencer Shepler: for NFSv4 all these problems exist, so getting that documented in an I-D at a minimum would be useful. Andrew: I would prefer to change the defaults in RED to this because i feel they are the only reasonable defaults.