ICCRG meeting, IETF 101, London Friday, Mar 23, 2018, 9:30 AM Agenda: (10 min) Paul Congdon - Proposed IEEE 802.1Qcz work (Congestion Isolation) (15 min) Ingemar Johansson - BBR congestion control with L4S support (25 min) Neal Cardwell - An Update on BBR Work at Google (20 min) Praveen Balasubramanian - LEDBAT++ (30 min) Michael Schapira - PCC: Performance-Oriented Congestion Control Minutes: a. Congestion Isolation. It was asked if there is a ‘congested’ queue defined for each ‘non-congested’ queue that you want to monitor. Effectively, yes, but it is possible to monitor multiple ‘non-congested’ queues and isolate to a single ‘congested’ queue. The diagrams showing how what events generate signaling only show a single upstream switch and downstream switch, but this diagram is simplified for illustration purposes. The reality is that the network is a L3 CLOS network and incast traffic is coming from multiple ingress ports. A question regarding how much of the solution is already known and will be standardized was asked. It is possible that multiple solutions can be considered for various aspects of the standard. The project defines the external behavior and scope that can be achieved with different implementation flexibility. A challenge in 802.1 hardware standardization is always providing enough specification for interoperability and correctness but allowing implementation flexibility. b. BBR with L4S support. The changes required to the Linux implementation were small and easy to add. There was some concern about exiting slow-startup on the first round after experiencing ECN. This is because there is some burstiness on start-up with incast and ‘drive-by’ traffic and the feeling is BBR might exit too early. More support for the idea and acknowledgement that the idea works with and uses ECN. How much benefit comes from BBR and how much from ECN? There are slides that compare BBR and BBR evo (with ECN and phantom queues). More interest in understanding what causes the improvement – is it BBR changes or ECN again. There was a suggestion to track the percentile usage to soften BBR. It was noted that the BBR max filter is too aggressive and adding ECN to this softens this. There was a question on how changing some of the parameters will adjust the results. This will be looked at offline. It was requested that the time scales between the gain cycle and the filter needs further analysis. There was a suggestion to look at how this performs with a mix of non-ECN traffic. It was pointed out that when to mark should be based on queue delays, not queue depths – this allows consistent end-to-end configuration. In general there was wide acceptance and approval of this work. c. BBR at Google. Review of BBR. It is used for TCP and QUIC on google.com and YouTube. Aggregation and BBR was discussed. Batches or consolidation of ACKs sometimes occur for optimization and efficiency (e.g. WiFi, cable modems, offload mechanism). The delivery rate estimation draft discusses how to estimate rate in the presence of this aggregation. A review of a WiFi 20 MB transfer was analyzed in detail, showing the impact of large gaps in the ACK stream. There is now an aggregation estimator include in BBR that allows a larger cwnd when the ACK stream is halted. They calculated an expected ACKed amount of data when these ACKs are not present. There was a clarification question about whether the receive window is considered and the answer was yes, the received window is considered. In addition to this, an adaptive draining algorithm was also implemented. It was discovered that it is important to limit the amount of time to drain the queue and use randomization of these phases to avoid mice elephant synchronization. There is a BBR implementation available on ns-3 and at Stanford. One question was regarding how to best use packet loss as a signal and how to consider the time-scale over which this loss occurs. On what time scale should a CC algorithm re-probe for bandwidth after experiencing loss? This is a fundamental question to balance link utilization and application performance. Key issues with BBR at Google are dealing with aggregation, packet loss signals and interworking with other CC approaches. A question was asked if tests have been done where the source/dest are on WiFi directly? This impacts the amount of aggregation that occurs. It was noted that ACK aggregation is a critical problem because it is widely used and when that is going on, it is very difficult to measure delay, thus creating a divided problem space. It was observed that the analysis has been limited to flows and not the transactional flow interaction. The coding group is looking how to deal with loss through coding and points out that collaboration is desired and possible. It was asked if BBR is being used on 5G networks today – answer is yes. It was asked if the type of loss is being distinguished and considered (L2 verse other). The wireless networks tend to retransmit and create delay variance. It was pointed out that some of the batching in the test results could be caused by power saving mode in WiFi. Another comment regarding packet loss is that it may be poison related, but also there are other correlated packet loss to be considered. Agreed that this should be considered in the long-term research. It was pointed out that latency is the important metric and not entirely focusing on packet loss (though the two are clearly related) and perhaps retransmission rates should be increased to reduce latency. d. Performance-Oriented Congestion Control. This is based on two publications referenced in the slides. Try different rates and analyze the response to generate a utility value on your performance. Perform these micro experiments over a time interval to help decide what the next rate to use on the next interval. This means the congestion control response is not hard-wired as it is in TCP. The mechanism in TCP were designed to balance between fairness and maximizing throughput. The PCC mechanism need to consider similar but is doing it using Game theory. The argument is that the TCP sender is a really bad ‘learner’ of the congestion state because the response is hardwired. It is acknowledged that PCC still has issues; suboptimal convergence and poor performance in mobile are examples. A version 2 is being worked on with the following two changes; changes to the utility framework and changes to the online learning algorithm. It is possible for different senders to use different utility functions and not impact convergence. Also, the learning algorithm utility function can determine the degree of adjustment of rate. It was noted that BBR is different but shares many similarities. A clarification on the dynamic environment is that it is an emulated environment where one of 5 channel variables (delay, speed, etc) are randomly varied every 5 seconds. The same dataset was used for each of the compared congestion control approaches independently. It was noted that BBR assumes a min-RTT, so if this parameter was compared in the dynamic environment, it likely isn’t appropriate. There was a question about supporting different utility functions and if the applications have completely different objectives (latency vs throughput), can you really allow these different functions to work in the same network? In the dynamic environment, how did they know the optimal rate? – the answer is strictly a raw calculation of what is most possible. Unclear how multiple flows with different utility functions would impact this. It was requested that a utility function address less-than-best-effort type of service. There was a question about what happens if the dynamic network changes of 5 second intervals is changed to something much shorter (e.g. order of ms or 1 second or so). This is ongoing research. The assertion that PCC doesn’t work as well in Mobile Networks is evidence that their analysis on the dynamic environment on a 5 second interval might not be realistic. A separate question was related to what the implementation cost of this is, whether they is have done this in comparison with BBR. It was asked if any application level metrics were considered in the evaluation (QoE) – the answer is they are starting to do this and agree it is important. Also asked if they have evaluated other parameters like tail-latency in data-centers? They have not done it, but the high-level concept should be applicable. Spencer asks a question in the form of a suggestion to address how this works with QUIC and also to identify if there are specific user-cases (corners of the Internet) that this might be able to progress itself quicker. There was a question about whether the 17x improvement on Satellite Networks is accurate – a discussion will be taken offline. The loss rate is determined by measuring SACKs. It was pointed out that L2 retransmissions can have an impact on these models. The researchers are very interested in both Mobile and Satellite networks. e. A new CC in bandwidth guaranteed networks. No time for this presentation.