AI/ML for Congestion Control & Congestion Control for AI/ML
CC for ML
ML-aware Congestion Control to Mitigate Stragglers in ML Training
ML for CC (CLCC, CLINE)
A Learning-based Congestion Control that Instantly Adapts to Unseen Network Environments
Summary
How to enable the ideal congestion control that immediately adapts to time-varying network conditions to persistently achieve high throughput, low latency, and fairness has been one of the most important questions in networking. Though ideas of exploiting learning-based controls such as DRL (deep reinforcement learning) have been tried, conventional machine learning methods have fundamental limitations that they are incapable of inferring optimal control actions for an arbitrary environment that has not been trained. Thus, under time-varying network, a sender needs to keep learning for unseen environments, which remain following challenges in collecting proper sample data to learn the optimal series of actions at the run time: 1) Due to the limited information a sender can observe, it is challenging for the sender to infer a fair share of the current network environments. 2) By the nature of distributed congestion control problem, multiple senders simultaneously learning to find optimal control actions cannot converge due to the coexisting senders' changing behaviors. 3) The exploration process of DRL in unseen environments requires a sender to take huge amount of random actions, which results in serious performance degradation. To provide an answer to these challenges, in this work, we propose a new learning-based congestion control, namely CLINE, with the following techniques. First, CLINE classifies the current network state in a much more detailed manner by paying attention to packet arrival patterns and leverages such information to identify the current network environment. Second, in unseen environment, CLINE leverages the conventional TCP algorithms that are guaranteed to make each sender converge to its fair rate. Finally, CLINE outputs a best-projected action in unseen environments through exploiting the “predictive world model”. Through extensive experiments under various network scenarios, we confirm that CLINE persistently presents high performance even under unseen environments by being able to instantaneously switch closely to the new desired sending rate even for an unseen environment while other DRL-based congestion controls spend substantial amount of time to make explorations and adaptation.
CLINE: Design Overview
Performance
The figure below shows the throughput and RTT under unseen environments where multiple flows are sharing a bottleneck link and experiencing dynamic bottleneck bandwidth. As it is shown, Copa and Cubic undergo difficulties in immediately adapting to the dynamic number of competing flows and the bandwidth. Due to the existence of multiple competing flows, Orca also cannot find the optimal fair share of the given network and keeps oscillating. On the contrary, thanks to its ability to output a best-projected sending rate, CLINE keeps inferring sending rates close to the optimal fair share even under unseen environments having multiple flows. We validate that CLINE is able to keep improving its best-projected action as it revisits an unseen environment repeatedly. As shown in the figure, CLINE outputs a projected sending rate even better than its previous best-projected sending rate, resulting in much less oscillation until the convergence.
Throughput and RTT of CLINE, Orca, Copa and Cubic with unseen environments.
Publications (Ongoing)
Members