Abstract: State-of-the-art deep learning models rely on large GPU clusters and various parallelism strategies, which in turn depend on collective communication (CC) operators to synchronize data.