In this paper, we present an innovative method called Gradient Agreement Filtering (GAF) to enhance distributed deep learning optimization by addressing the issue of orthogonal or negatively correlated gradients in microbatches. By calculating the cosine distance between micro-gradients during training and filtering out conflicting updates, we reduce gradient variance and improve validation accuracy with smaller microbatch sizes. This technique also helps mitigate the memorization of noisy labels, showcasing significant performance improvements on CIFAR-100 and CIFAR-100N-Fine image classification benchmarks. Our approach consistently outperforms traditional methods by up to 18.2% in some cases while greatly reducing computational requirements.
https://arxiv.org/abs/2412.18052