TY - GEN
T1 - Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop All-reduce with Ultimate Compression
AU - Wu, Feijie
AU - He, Shiqi
AU - Guo, Song
AU - Qu, Zhihao
AU - Wang, Haozhao
AU - Zhuang, Weihua
AU - Zhang, Jie
N1 - Funding Information:
This research was supported by the funding from the Key-Area Research and Development Program of Guangdong Province (No. 2021B0101400003), Hong Kong RGC Research Impact Fund (RIF) with the Project No. R5060-19, General Research Fund (GRF) with the Project No. 152221/19E, 152203/20E, and 152244/21E, the National Natural Science Foundation of China (61872310, 62102131), Shenzhen Science and Technology Innovation Commission (R2020A045), and Natural Science Foundation of Jiangsu Province (BK20210361).
Publisher Copyright:
© 2022 ACM.
PY - 2022/7/10
Y1 - 2022/7/10
N2 - Traditional one-bit compressed stochastic gradient descent can not be directly employed in multi-hop all-reduce, a widely adopted distributed training paradigm in network-intensive high-performance computing systems such as public clouds. According to our theoretical findings, due to the cascading compression, the training process has considerable deterioration on the convergence performance. To overcome this limitation, we implement a sign-bit compression-based learning synchronization framework, Marsit. It prevents cascading compression via an elaborate bit-wise operation for unbiased sign aggregation and its specific global compensation mechanism for mitigating compression deviation. The proposed framework retains the same theoretical convergence rate as non-compression mechanisms. Experimental results demonstrate that Marsit reduces up to 35% training time while preserving the same accuracy as training without compression.
AB - Traditional one-bit compressed stochastic gradient descent can not be directly employed in multi-hop all-reduce, a widely adopted distributed training paradigm in network-intensive high-performance computing systems such as public clouds. According to our theoretical findings, due to the cascading compression, the training process has considerable deterioration on the convergence performance. To overcome this limitation, we implement a sign-bit compression-based learning synchronization framework, Marsit. It prevents cascading compression via an elaborate bit-wise operation for unbiased sign aggregation and its specific global compensation mechanism for mitigating compression deviation. The proposed framework retains the same theoretical convergence rate as non-compression mechanisms. Experimental results demonstrate that Marsit reduces up to 35% training time while preserving the same accuracy as training without compression.
KW - distributed machine learning
KW - multi-hop all-reduce
KW - signSGD
UR - https://www.scopus.com/pages/publications/85137533530
U2 - 10.1145/3489517.3530417
DO - 10.1145/3489517.3530417
M3 - Conference article published in proceeding or book
AN - SCOPUS:85137533530
T3 - Proceedings - Design Automation Conference
SP - 193
EP - 198
BT - Proceedings of the 59th ACM/IEEE Design Automation Conference, DAC 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 59th ACM/IEEE Design Automation Conference, DAC 2022
Y2 - 10 July 2022 through 14 July 2022
ER -