TY - JOUR
T1 - Error-Compensated Sparsification for Communication-Efficient Decentralized Training in Edge Environment
AU - Wang, Haozhao
AU - Guo, Song
AU - Qu, Zhihao
AU - Li, Ruixuan
AU - Liu, Ziming
N1 - Funding Information:
This work was supported in part by the National Key Research and Development Program of China under Grant 2016QY01W0202, in part by the Hong Kong RGC Research Impact Fund (RIF) through the Projects R5060-19 and R5034-18, in part by General Research Fund (GRF) through the Projects 152221/19E and 15220320/20E, in part by the Collaborative Research Fund (CRF) through the Project C5026-18G, in part by the National Natural Science Foundation of China under Grants 61872310, U1836204, and U1936108, in part by the Shenzhen Science and Technology Innovation Commission under Grant R2020A045, in part by the Shenzhen Basic Research Funding Scheme under Grant JCYJ20170818103849343, and in part by the Fundamental Research Funds for the Central Universities under Grant B210202079.
Publisher Copyright:
© 1990-2012 IEEE.
PY - 2022/1/1
Y1 - 2022/1/1
N2 - Communication has been considered as a major bottleneck in large-scale decentralized training systems since participating nodes iteratively exchange large amounts of intermediate data with their neighbors. Although compression techniques like sparsification can significantly reduce the communication overhead in each iteration, errors caused by compression will be accumulated, resulting in a severely degraded convergence rate. Recently, the error compensation method for sparsification has been proposed in centralized training to tolerate the accumulated compression errors. However, the analog technique and the corresponding theory about its convergence in decentralized training are still unknown. To fill in the gap, we design a method named ECSD-SGD that significantly accelerates decentralized training via error-compensated sparsification. The novelty lies in that we identify the component of the exchanging information in each iteration (i.e., the sparsified model update) and make targeted error compensation over the component. Our thorough theoretical analysis shows that ECSD-SGD supports arbitrary sparsification ratio and achieves the same convergence rate as the non-sparsified decentralized training methods. We also conduct extensive experiments on multiple deep learning models to validate our theoretical findings. Results show that ECSD-SGD outperforms all the start-of-the-art sparsified methods in terms of both the convergence speed and the final generalization accuracy.
AB - Communication has been considered as a major bottleneck in large-scale decentralized training systems since participating nodes iteratively exchange large amounts of intermediate data with their neighbors. Although compression techniques like sparsification can significantly reduce the communication overhead in each iteration, errors caused by compression will be accumulated, resulting in a severely degraded convergence rate. Recently, the error compensation method for sparsification has been proposed in centralized training to tolerate the accumulated compression errors. However, the analog technique and the corresponding theory about its convergence in decentralized training are still unknown. To fill in the gap, we design a method named ECSD-SGD that significantly accelerates decentralized training via error-compensated sparsification. The novelty lies in that we identify the component of the exchanging information in each iteration (i.e., the sparsified model update) and make targeted error compensation over the component. Our thorough theoretical analysis shows that ECSD-SGD supports arbitrary sparsification ratio and achieves the same convergence rate as the non-sparsified decentralized training methods. We also conduct extensive experiments on multiple deep learning models to validate our theoretical findings. Results show that ECSD-SGD outperforms all the start-of-the-art sparsified methods in terms of both the convergence speed and the final generalization accuracy.
KW - communication compression
KW - decentralized training
KW - Distributed machine learning
KW - error compensation
UR - http://www.scopus.com/inward/record.url?scp=85107227588&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2021.3084104
DO - 10.1109/TPDS.2021.3084104
M3 - Journal article
AN - SCOPUS:85107227588
SN - 1045-9219
VL - 33
SP - 14
EP - 25
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 1
M1 - 9442310
ER -