Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop All-reduce with Ultimate Compression
TimeTuesday, July 12th1:53pm - 2:15pm PDT
Location3000, Level 3
Event Type
Research Manuscript
ML Algorithms and Applications
DescriptionCascading compression occurs and leads to divergence when traditional sign SGD algorithms realize one-bit transmission under multi-hop all-reduce (MAR) that has been widely adopted in network-intensive high-performance computing (HPC) systems like public clouds. In this paper, we implement a one-bit compression-based learning synchronization framework Marsit that prevents cascading compression and ensures the compression operation can be done in parallel. Theoretically, the proposed framework retains the same theoretical convergence rate as non-compression mechanisms. By investigating the experimental results, our approach preserves the same accuracy as the methods without compression while spending up to 35\% less time in the training.