============================================= LEDBAT WG Meeting IETF 75 Stockholm, Sweden Wednesday, July 29, 2009 13:00 - 15:00 Congresshall B Chairs: S. Shalunov M. Sridharan B. Aboba (acting) Preliminaries (10 minutes) Note well Blue Sheets Minute Takers (Matthew J Zekauskas, Al Morton) Jabber Scribe Agenda bashing Agenda Bash Iljitsch van Beijnum would like to get feedack from folks here, on the appropriate size of buffers in home gateways. See homegate@ietf.org, tsvarea discussion ongoing. Document Status No current WG documents. Three documents under consideration for adoption as WG work items at this meeting. Documents Under Consideration for Adoption as a WG Work Item (80 minutes) 13:10 - 13:40 Low Extra Delay Background Transport (LEDBAT), S. Shalunov (30 minutes) http://tools.ietf.org/id/draft-shalunov-ledbat-congestion Doc hasn't changed from last time, but there was discussion on mailing list that we want to address. First, what are the goals? To saturate the network, but keep delay low. Yielding to TCP and adding little extra delay are consequences of this, but we state them separately. Stas gives an overview of the congestion control goals, some features in Pseudo-code. Receiver just keeps telling sender measured delay Sender is more complex. keeps delay low by measuring and reacting, notion of target delay. controller is a proportional-integral-derivative (PID) controller as safety measure, on loss, halve window. Question: what is the framing? UDP? TCP? Answer: The congestion control mechanism is independent of framing. In an early draft of the charter, UDP framing was specified. It is not in the current charter. UDP is the most expedient way to deploy LEDBAT, but it could work over TCP as a modification (though timestamps would be required). Question: Can it work on top of the TCP control loop? Answer: Don't know, doesn't seem like a direct application, but not part of the WG charter. Could be a DCCP CCID, or SCTP modification/option (timestamps required). Lars: We can discuss framing after the algorithm has stabilized and people have played with it (document is slated for Experimental). Once we have implementation experience, we will take it from there. Late-comer's advantage (or first mover's advantage): Consider (general prob w/delay based congestion control, discussed w with respect to TCP Vegas, fast.): Latecomer could be deceived by outstanding queue, think base delay is higher than it is. The latecomer can't look inside the queue and see packets, it only measures the delay. If have you have a stack of latecomers, since they underestimate queue size, they could starve out the first movers. When a sender experiences starvation, their estimated queue size is larger, and they don't put traffic on the bottleneck, so the queue size will drop, and the latecomers will get a better baseline measurement. As a result, the latecomers queue size estimate will increase and they will decrease their sending rate. So we think that the late mover advantage will be short-lived. Such an advantage can be observed in a simulator, but not in the wild, because in a simulator it is easy to get phase-locked packets. In real networks you have more jitter, such as disk seeks, which introduces randomness. More randomness increases the probability of extreme conditions (e.g. minimum estimate of baseline delay). Perhaps the answer might be different for a huge number of flows. a higher degree of statistical multiplexing. But with a few hundred flows, dips will show up regularly. Rich Woundy, Comcast: any measurements to confirm this behavior? Stas: fairness seems to be preserved. No studies showing measurements in the wild. Lars: The question asked for an Experimental status is: "Will this be safe on the Internet?" However, the longer-term question is whether this will be effective in reducing Queue size. Rich's 2nd Question: any hardware dependencies? If we move to solid state storage instead of hard drives, so that seek times decrease, do the dips go away? Stas: Operating systems schedulers still introduce jitter in user space, no way around that. Matthew J Zekauskas: you can see this kind of RATE dip on Internet traffic today. Stas: You select the target, such that multiplication by the number of flows works. You don't multiply the Target by the number of flows. All flows target the same delay, whether there is one flow or fifteen. Rich again: do all apps have to use the same queuing delay target? If they pick different queuing delay targets, does that work? Fairness by random redistribution The current draft does not contain this discussion, this will be fixed. How do multiple flows end up sharing the queue and why? As a thought experiment: suppose we want as a design goal to redistribute a bit of capacity from connections randomly, while keeping same total target link capacity. It is intuitively clear, and can be easily shown rigorously, that this could be accomplished by introducing randomness in measurement. Instead of using the measured delay, add a random quantity to it, which doesn't have to be large. Whatever we need to redistribute, say 10 percent of RTT. Error needs to be a fraction of the target that's equal to # packets in RTT/10 in that case. This is a very small fraction. If there are few packets in RTT and the target is 25ms queuing delay, we are talking about sub-ms errors. So the question is whether randomness is there, such as in phase of arrival of serialization with respect to the prevous packet. This is the same randomness that causes dips to appear (and evens out the estimates of base delay). lars: related work, RFC 5148, MANET. talks about jittering timers in routing protocols for different purpose, but effect is the same. Parameter Values Choice of parm values is a little more interesting. Two parameters here: GAIN and TARGET. All values on the slide are "not insane". For various definitions of "work", all work. We can go further beyond these ranges. But how do we choose? Choice of GAIN is more arbitrary than choice of TARGET. GAIN: how fast do you converge, how stable is it. large value of GAIN ->fast convergence, small value ->stable. 1 MSS/RTT and 10 MSS/RTT are both conservative. 1 has interesting property: if delay measurement is completely broken, and zero queuing delay is always measured, this replicates the ramp up of a single TCP flow. This is why the value 1 was chosen originally, since we know that if we ramp up as with a single TCP flow, the world doesn't end. TARGET: is less arbitrary. There is a wide range of potential values that are all sane. The numbers in the draft work, but aren't magic. Higher target values are more robust: you get a smaller relative error for any absolute error Lower target values add less delay; clearly there are diminishing returns. Going from 1 sec to 0.5 sec queueing delays, is a huge difference for interactive users Going from 2 ms to 1ms queueing delay makes no difference whatsoever. 1ms is way too low, it is not a value that wold work, so that is just an example. Human reaction/perception threshold seems like a useful reference. Add or subtract typical RTT, don't add much of a disservice. Unless you are a pro gamer, you probably can't notice a difference of 25ms in ping time. Why is 25ms there? Could we make it 10 or 50ms? Yes. Ted Hardie: thank you for not having magic numbers in your head. For human perception in applications it is rarely a single flow that is taken into account before presenting to the end user. This is true for large video flows, but for other applications, a combination of multiple things typically will cause UI actions to appears. Hundreds of different web sites can be required to render a single web page in extreme cases. Stas: another useful benchmark, not on the slide, is the speed of light limit on RTT, which is not a very large fraction of Internet delay. Lars: want to disagree with Ted a bit. If directly connected to server, and add this, making that server look 25ms away, is OK. 100 to 125ms is probably ok. We are not multiplying by the number of flows. Sean Doran: Let me pile on Ted. Are the magic numbers for humans? For audio, less than 70ms is useless. Ted: I disagree with Lars. In the case of a web page, you must go and fetch the links to render the page, and that might involve redirects to other sites. Each additional 25ms delay can start to add up to a long rendering time, which is perceptible by the user. We could be talking about 30 RTT, which is not small. Lars: consider if we are in TCP slow start, with each RTT that might be longer. If we move the server 25ms further away from you, in many use cases, it will not matter, but there are some cases where it does: NFS. Think of this as adding 3000km physical distance. For some applications that is a big deal. Bruce Lowekamp: multiply the target by # flows. Does that assume that there is a single bottleneck for all flows, or all have same destination? Stas: We are not assuming that. A single bottleneck is the typical case, but if not, there is no multiplicative effect. POLL: Adopt this a WG Item - 18 for adoption, none against. Consensus to be verified on the list. 13:40 - 14:10 LEDBAT Practices and Recommendations, R. Penno (30 minutes) http://tools.ietf.org/id/draft-penno-ledbat-app-practices-recommendations R. Penno is present in Stockholm but is ill and so cannot present. We will go over the slides from IETF 74 again. The document has not been revised since IETF 74. Stuart Cheshire: it is not clear that multiple parallel connections result in higher throughput. I think that is a common fallacy, particularly on lossy networks. If there is not enough data to trigger fast retransmit, the whole thing slows down. Stas: As a general question, I'd like to hear the WG opinion on how much this draft should describe practices, and how much it should make recommendations. In other words, do we need a better survey, or better guidance? Lars: when we wrote the charter, we wanted recommendations. We saw confusion out there. Stuart just illustrated once such issue. So having the IETF comment on this would be useful. Stas: Personally I think the document doesn't contain enugh recommendations, but just collects evidence. We stil have to work on recommendations. Bob Briscoe: To come back to Stuart's point: 1. Educate and explain. 2. Recommendations may be difficult. Stuart: multiple connections get bad performance, if each only has a few packets and also if they are not using full size packets, then packing in stream. Bob Briscoe: Oh, I misheard. I agree! Stas: if one connection gets all the throughput you should be getting, more connections will not help you. Stuart: more connections does help when parallel connections get a bigger part of congested link. But my sense is those cases are fairly rare. When Mosaic did it, we only had HTTP 1.0 and could not pipeline GETs. Now a days, you can send 16 GET Requests over a single TCP connection, which is much faster than opening 16 connections, each with one GET. Michael Welzl: made point in ICCRG, MulTFRC, gave upper limit of 6 based on some data from INFOCOM paper. Think could be used for some of these, this gives a rough view. Stas: Les Cottrell from SLAC did some measurements. Where connections were window-limited mostly, very clear that dropoff past certain point. Start losing perf, even when trying to saturate link with connections. Get past 10-20 connections and performance goes down. These are somewhat similar. Bob Briscoe: this depends on if you have a shared link or a self-congested link. Self-congested, won't; shared, may. Stuart: one quick final comment, given that we are designing protocols that everyone uses. If one person is greedy, they get a bigger share. If everyone is greedy, we're back to square one, but we're more inefficient. Stas: It is clear that in the doc, trying to be greedy gets more out of a congested bottleneck, but that is not a good reason to use multiple connections. Only classic product that opens multiple connections to get bigger share, is download mgr. Bob Briscoe: We can't tell intent at build time; we don't know if we have a shared link or a self-congested link. No matter what the intent is, it might get misused in another situation. To reinforce Stuart: when we start an arms race, TCP squares the amount of congestion. If everyone puts in 2 more, we get 4 times. 10 more, 100 times, and then "congestive collapse". Richard Woundy, Comcast: I have a more fundamental question about this document. Is it under active authorship? Are the authors continuing? They didn't update the document, and they aren't here to present it. There is clearly a lot of work to do. Stas: I don't have an update on the status. If it turns out that we need more contributors or another editor, it is easier to do if it is a WG document under IETF change control. The WG can't find an editor for a non-WG document. Lars: since there is no WG work item yet in this area, if someone wrote a document that the WG liked better, that would move forward. If it is a WG work item, we could talk about replacing the editor. Stas: We did talk to the editor and he wanted it to become a WG work item. He is in Stockholm, but has fallen ill. Lars: that explains why he is not presenting, but not why there is no update. I hate to step on somebody's toes, but I'm hesitant if there is a lack of editing cycles and the document is not a WG work item. This is a situation where we frequently run into problems. Bernard: Who is interested in working on this document and contributing to it? 2 folks raise their hands (Vijay and Richard Woundy). Stas: This shall not go unpunished. Who thinks this document should be accepted as a WG work item? Zero hands. Who thinks it should not be accepted? Two hands. Vijay: Since IETF 74, there has been no update, and it needed more work. Lars: Let's get a sense of how to move forward. If the document were to be updated a few times, would people think it could be ready, or will the document never be ready? Who thinks it would be ready if worked on some more? 11 people raise their hands. Who thinks that no matter how much effort goes into it, something is fundamentally wrong? Zero hands raised. Clear message: get an active editor and put in cycles. 14:10 - 14:30 A Survey of Lower-than-Best Effort Transport Protocols, M. Welzl (20 minutes) http://tools.ietf.org/html/draft-welzl-ledbat-survey This document is a literature review, to help us avoid reinventing the wheel. The document looks at delay-based congestion algorithms, as well as application layer mechanisms. delay based algorithms TCP Vegas. not designed to be lower than best efforts (LBE). But nice example, LBE in presence of Reno, but better if it is the only one. Others based on Vegas, but designed for LBE: TCP Nice, TCP-LP non-delay based Also designed to give way, though growing less than TCP. 4CP, uses virtual window to limit congestion window earlier MulTFRC. 0.1 weight. 10x less agressive. But requires queue growth, reacts only to losses. app layer approaches. so far, rcv window tuning. Some quite sophisticated ones see SIGMETRICS 04 paper. claim: could work as well as transport-layer scheme Bernard: This document is under consideration for adoption as a WG work item. Who wants to see it adopted? Lars: This would be published as Informational. 18 hands. Who doesn't want to see it adopted? No hands. Consensus to be verified on the list. ==== 5 minute Item - Home Gate Iljitsch van Beijnum Bar BOF Monday on home gateways: HOMEGATE. There is no charter, all is up in the air. How much buffering should be in Home Gateways? Wants people to send their info and thoughts about buffering and queueing strategies. join homegate@ietf.org or send feedback to Iljitsch: www.ietf.org/mailman/listinfo/homegate Stas - don't do things in HOMEGATE that make LEDBAT's job harder. Very small buffer would make congestion control more difficult. Prefer the "Do no harm" criteria for recommendations. Bob Briscoe: Weighted Fair Queuing isn't designed for Homegateways -- don't isolate the apps from each other. Mark Handley - very small buffers on GigE links with high stat mux Richard Woundy - 10 packets may not be enough buffer, 150ms to 200ms worth <100 but more than 10 packets... Dave Oran - the box that controls L2 state (e.g. 802.11 power save buffering) has no interaction with the device L3 queuing. Think Cable Modem: "Comcast, it's our fault." Back to Richard's comment: need longer queue sometimes. Next Steps and Wrapup (30 minutes) 14:30 - 15:00 Chairs & Area Directors (30 minutes) Nothing to Discuss - all votes were Unanimous.