Minutes of the ICE Working Group at IETF 94

The ICE working group met at IETF #94 in Yokohama, Japan on Thursday November 5th, 2015 from 9:00 to 11:30.

The meeting was chaired by Ari Keränen and Peter Thatcher.

Taylor Brandstetter, Magnus Westerlund, and the chairs took notes and Jonathan Lennox acted as Jabber relay.

The meetings were broadcast live and recorded by the Meetecho team. The recordings of the sessions are available at the following URL:

http://recs.conf.meetecho.com/Playout/watch.jsp?recording=IETF94_ICE&chapter=chapter_1

Below is the final agenda with links to the relevant sub-sections below:

09:00-09:30 Introduction and Status Update (30 mins, Chairs)

09:30-10:00 Trickle ICE (30 mins, Justin Uberti)

10:00-10:45 Timing of STUN messages (Ta) (45 mins, Chairs, Pål-Erik Martinsen)

10:45-11:15 ICEbis (30 mins, Ari Keränen)

11:15-11:30 Future improvements to ICE (15 mins)

Chairs Introduction and Status update

Chairs presented ICE working group status and background.

Status of WG

Peter Thatcher presented intro to ICE.

The dual stack fairness draft was already WGLC’d. The latest version is in good shape but chairs would like to get more reviews. Bernard Aboba volunteered to review the draft.

Question: How do we handle issues and documents? Github is working well for other groups, so this is what Ari/Peter suggest. +1 from EKR.

Cullen Jennings: Github is great for the people involved, but makes it more confusing for observers.

Bernard Adoba: Github activity could be broadcasted to the mailing list. One way is to register mailing list as contributor.

Barry Leiba: HTTPbis is currently doing this, using github for discussions and issue tracking. Separate mailing list for github emails.

Christer Holmberg: When I want to create my own draft, do I need to create my own github repo?

Peter Thatcher: It's easy to add repositories, so there's no problem with asking a chair to add an additional one.

Ari Keränen: As a general approach we should have all documents located on github.

EKR: Lack of clarity about workflow. Do we have main discussions on the mailing list, and use github for issues and minor discussions, as we do for rtcweb? Or have everything on github, like HTTPbis does?

Ben Campbell: As Cullen said earlier, the more dedicated the workflow is on github the harder it is for non-high-velocity people to know what's going on. We should expect transport people to be in this category.

Martin Thomson: I suggest that the approach by rtcweb is the better fit (as opposed to HTTPbis). Discoverability is an issue for some people; this could be solved by including a paragraph that says "this is the copy we're working on". We should establish a chain of links on IETF and github so that people can go from published version to working version to issues list.

Cullen: Because we're using the github issue tracker to track things, we should be able to subscribe to one repository rather than one for every new draft. So I suggest having one repository for all drafts. It's useful (REQUIRED, in case one day github goes away) to have the mailing list for an archive, but it's not very useful for figuring out what's going on.

Martin: Having multiple drafts in one repository is fine, but it means the author of any draft can push any update. But I think we can manage that.

EKR: This also means that if you look at the issues, you'll see them all in one bucket. So either we need multiple repositories, or we need to label every issue diligently.

Cullen: But I think people having energy to discuss things is more important than making things easy for the editors. Technical discussion is happening on github anyway.

Harald Alvestrand: The tools at github are anemic when it comes to dealing with anything larger than one repository. That said, there are lots of reasons why things get difficult when you have multiple things in one repository. Another point, sometimes it has been difficult to get people to move discussions from github to the mailing list.

Conclusion: github seems like a good idea, but we need to work out the details of the workflow, especially for having discussions in the right place. Will go forward and figure out details on the way.

Trickle ICE (30 mins, Justin Uberti)

Justin presented his slides.

Can we unfreeze all checklists based on foundation only? If we receive a useful candidate for a second checklist, but not the first, we can't unfreeze it, currently.

Jonathan Lennox: If you didn't do this, what does this do to the relative check times? I think it still makes sense to do everything on the first component first. Once something succeeds on one component you unfreeze the others.

Justin Uberti: But if for some reason the first component is delayed, it delays things.

Jonathan: But that only happens if you have a packet lost or something.

Justin: Yeah, it's edge-casey. I'll open an issue and we'll discuss it there.

EKR: Is the issue that we get some candidate, but it's useless, but now because you've already decided on the first component you're checking, candidates for other components aren't used? If so I agree this is edge-casey.

Justin: Then let's agree we'll punt on this and keep the existing rules for freezing.

Conclusion: keep the existing rules for freezing

If one component doesn't have the same foundation as another component, unfreezing logic doesn't work.

Jonathan Lennox commented that trickle rules could enforce order, by providing components in order and maintain order.

Emil Ivov: We shouldn't design at the mic. We already agreed that we trickle candidates in the order they'd be signaled, but if we want to change that, I'd be open.

Jonathan: Trickling earlier doesn't gain anything if those pairs stay frozen. It just changes where the implementation burden is.

Emil: Adding new states for components would make things complicated, so I'd prefer we keep the order.

Peter (as individual): I think it's more burden on the sender side and less on the receiver side. It would also slow things down if signaling is slow. Plus, in WebRTC, we already have to deal with Javascript handing down remote candidates in the wrong order.

Jonathan: We need signaling that means there are no candidates.

Justin: How do we detect this?

Jonathan: TURN allocation timed out, etc.

Peter: Can we change the freezing algorithm to be more lax?

Conclusion: Should take this to the mailing list what to do with unfreezing when one component doesn't have the same foundation as another. Signalling "no candidates for this component" is problematic. Delaying the signalling of other candidates is problematic. Still looking for a solution.

Do we need a different procedure for checking duplicate candidates? In Vanilla ICE the sender side filters out duplicates, but with trickle ICE we have no way of knowing if a duplicate will come later.

EKR: We currently just throw away the duplicate.

Jonathan: This situation is likely to happen when you discover that something prflx is srflx. Also, this applies to aggressive nomination since both sides need to agree on the highest priority pair.

Justin: We could also have a host candidate and then get a dup srflx, but that's a theoretical problem. I'm unclear about the consensus; we either say first one wins, or replace lower priority with higher priority; will take this discussion to the list.

EKR: A third option could be to not prune the duplicate candidates.

Justin: But that means you'll generate checks for both, we don't want that.

EKR: This situation only happens if there's an implementation error on the other side, so we should take the easier option.

Justin: First one wins is easier.

Conclusion: As suggested earlier, will discuss the two options for handling duplicate candidates (first one wins or replace with higher priority) on the list.

Proposing to add a "waiting-for-candidates" state for if the checklist is empty and no candidate pairs have been sent or received.

Ari (individual): Seems like an editorial change, so seems reasonable.

Justin: I just don't think trickle ICE should define new ICE states.

Jonathan: I don't see why not since it's already defining new ICE behavior.

Conclusion: We'll go ahead and add this state to Trickle ICE draft.

Ari: ICE restart needs to be clarified, but this will be discussed in the ICE bis slot.

Justin suggests that we remove SDP stuff from Trickle ICE and put in separate document, like was done with ICEbis. Will end up either in an SDP for ICE or SDP for Trickle ICE document in MMUSIC.

Jonathan: SDP for Trickle ICE needs to be worked in the same WG as SDP for ICE. Also relevant for JSEP.

Justin: the SDP could also go to ICE SDP draft. Anyway, will be done in MMUSIC.

Conclusion: Agreed that continuous nomination is out of scope for Trickle ICE (three thumbs up in audience).

Timing of STUN messages (Ta) (45 mins, Chairs, Pål-Erik Martinsen)

Ari presented an introduction to the timing issue. Peter Pål-Erik presented his proposal for fixing timers. Peter presented measurement results.

When doing connectivity checks, STUN messages are paced avery Ta ms. For non-RTP traffic, 500ms resulted in poor performance, so most implementations ignore the "MUST". For RTP, 20ms was OK. Concern was that if we send pings too fast we'll overload the NATs and links.

Emil: Some people were concerned that if you created mappings too quickly you would crash the NAT, but this only happened with one NAT, so it's mainly a bandwidth issue. But I don't think we should specify the time, in particular for bandwidth use, in this spec because it may become obsolete, as 500ms has. Should just caution about the risks.

Ari (individual): We already have the ability to negotiate the timer value, we just want a normative lowest value.

EKR: Matters for 3 reasons: not overloading NAT, matters for congestion control, and matters for security (ICE stack used for DDoSing). The latter two reasons are why this is really needed. The second reason applies to SIP and webrtc.

Emil: We should address congestion, but if we say 10ms, it may become obsolete in 2 years.

Cullen: When these numbers were picked, people thought they were ridiculous, but one reason they were picked is so they can work on satellite links, and that hasn't changed much.

Justin: We want this to start working not only on 4G/5G but slower networks, so what we should be coming up with is a maximum bandwidth, and give a recommendation for what works in most situations.

Magnus Westerlund: Problem is twofold: networks are getting faster, but some are staying just as slow. In the worst case you have no information about the path so you need to err on the safe side.

Peter: Collected data and found that although successive binding requests have a higher chance of failing, decreasing Ta - surprisingly - increases the success rate.

Martin: This is counter-intuitive, do you have any hypothesis?

Peter: Nope.

Cullen: So whatever we think we know that lead us to pick 20ms is probably wrong. We should try to replicate the results on a single link and single NAT (Cullen volunteers).

Action item: Get the absolute success rate of the first batch (graph only shows second/third batches relative to first).

Justin: What we're suspecting is that the firewall is the determining factor.

EKR: We need to understand what is happening here.

Cullen: Is there any chance there's a load balancer sitting in front of this thing?

Peter: We've only tried Google STUN servers, we could try others.

Magnus: Firewall seems plausible.

Jonathan: Was this on mobile or desktop? Could be some funny behavior on the wireless link.

Action item: Slice data between desktop and mobile.

Martin: It's good that we started measuring but we need to know specifically what we're measuring.

Peter: We tried lowering Ta on a small sample of real users and no issues so far.

Magnus: Worried about experiment bias. The use case (Chrome/WebRTC) could affect the results.

Peter: Note that the initial data shown isn't specific to WebRTC, just Chrome. Only the real-user experiment is WebRTC-specific.

Randell Jessup: Also concerned about bias. Data may be heavily biased towards people with a lot of available bandwidth.

Justin: That could be true. But this data is more to see if there's an issue besides bandwidth (namely NAT bindings) that affects connectivity check success rate. We just need to be careful about what we use this data for.

Chairs: We need more experiments and better understanding for what is measured.

Should we have timer values in a separate draft?

EKR: Nah. Let's burn that bridge when we come to it (ICEbis last call).

Ari: The timers are the main remaining open issue in ICEbis though.

Cullen/Magnus: We won't be able to get this published without deciding on values.

Ari: So should we publish with more conservative values, or wait until we figure out more appropriate values? We should ask how urgent finishing ICEbis is.

3 people think it's important to have it done by next summer. 3 people want to keep bashing.

Conclusion: it’s OK to keep working on the timer values until the summer.

ICEbis (30 mins, Ari Keränen)

Ari presented his slides.

Pål-Erik Martinsen: The ICE pacing is not always implemented as stated. Is it important that binding requests are paced exactly Ta apart?

Emil: We're not pacing STUN messages, but STUN transactions.

EKR: Firefox paces STUN transactions but has a throttle for all STUN messages based on bits per second.

Martin: We have two token buckets, one allowing short bursts and a second one with lower bandwidth.

Ari: In a browser environment you could have multiple apps doing ICE at the same time which is why you need the absolute throttle.

Martin: Even with one ICE session you could do things like have a large ufrag, or add STUN attributes. I suggest normative text for a maximum rate.

Ari: Concern is picking the number, we tend to spend a lot of time on that.

Cullen: In browsers one definitely needs a global congestion control that limits the total bit-rate generated by ICE checks. This applies to everyone that can have multiple simultaneous ICE checkboards.

Justin: I don't think we can give a single number because different use cases have different available bandwidth.

Jonathan: The available bandwidth on your different paths is also going to be different. It may not make sense to have the same Ta value for different interfaces. In a browser a single max value may make sense, but not in other situations. Also could be a privacy issue (detecting if an origin is using ICE).

Cullen: There are other issues much worse and there are no known solution.

Martin: Disagree with premises of argument, but agrees with conclusion, don't need to do anything.

Conclusion: Need to nail down: Lower bound for Ta, max bw, and recommended check interval. Taking this to the mailing list.

Can both sides change ICE ufrag/password at the same time (trigger ICE restart)?

Justin: We run into a glare issue. Text needs to clarify how this gets resolved, either by a tie-breaker or coordination. Probably needs to get cooperation between SDP and ICEbis. Right now we need to focus on clearing up ambiguity/edge cases, then we can worry about one-sided ICE restarts and other things.

Jonathan: Without using offer/answer, there are ambiguities. Such as sending two ICE restarts and getting a response, and not knowing which restart it's for.

Chairs: ICEbis needs to handle this glare issue, and sending two ICE restarts from one side.

Emil: To match candidates to an ICE restart, we can trickle the ufrag along with the candidate.

Jonathan: That's part of the solution.

Peter (individual): Need to define the behavior in between one side changing its ufrag/password and the other side changing it. And can you trickle candidates from two gathering phases at once? Could go with the current behavior (no checks until both sides do restart and only get candidates from the new gathering phase), but would be valuable to consider optimizations.

Ari: Yes, we need to clarify.

Jonathan: Is it correct that an ICE restart can only occur with offer/answer?

Emil: I think it's a good idea not to require an offer/answer, and trickle the ufrag as suggested earlier.

Jonathan: But there are still cases where you don't know what restart it's associated with. If you trickle two new ufrags and get a new one back, which restart is it for?

Decided to take this discussion offline.

Justin/Ari: Need to deal with this in the SIP ICE document.

Passive Aggressive

Peter presented his slides.

This involves clarifying the RFC5245 text that says "you can send media before a pair is nominated", and deprecating aggressive nomination.

Chrome implementation is ready: can send and receive before pair is nominated (though still does aggressive nomination to be safe).

Bernard: If you want to deprecate it, you have to do it in ICEbis. So it's not a new work item, it's part of ICEbis.

Ari: This isn't really deprecating, just making it "SHOULD NOT".

Martin: This shouldn't be difficult on the sending side. The tricky part is that some implementations will drop packets received before nomination.

Jonathan: When you say "send media", you mean "send DTLS"?

Peter: Yes, DTLS and media.

Martin: The advantage is that when the offerer is nominating and the answerer is starting DTLS, you can get DTLS going faster, before the answer is received.

Ari: We need text that addresses what happens if you send media/DTLS early.

Martin: Don't think we need to actually deprecate aggressive nomination on the receive side. Can just say that if there are two nominations, the higher-priority pair is used.

Bernard (from Jabber): I think you may need to discuss how to recognize if the receiver can't handle this (e.g. no responses are received to the media sent).

Chairs: should propose some text to the mailing list and discuss it there.

Conclusion: Propose some text to the mailing list and discuss it there.

Future improvements to ICE (15 mins)

The "walk out the door" problem

Peter presented his slides.

Peter: No drafts, just presenting what we've tried so far.

ICE restarts are too slow to solve the "walk out the door" problem. Continual gathering worked well, but there were ambiguities when an ICE restart was required. Backup pairs worked well but consumes battery.

Pål Martinsen: Can you actually do this in the OS?

Peter: Android gives the app a notification.

Justin: In later version you can access multiple interfaces, although one is default.

Martin: We've tried backup pairs and it works well. Why is battery an issue? If this is just occasional consent checks it shouldn't make a big difference.

Peter: It means you have to keep the radio on though.

Martin: Good point.

Peter: “Walk in the door” problem isn't that difficult, because you can do an ICE restart while you still have a 3G connection.

Trickle/fast restart idea: Each side restarts independently and trickles ICE ufrag/password. Candidates paired across generations. Google going to be exploring this further.