TY - JOUR
T1 - LOSP: Overlap Synchronization Parallel with Local Compensation for Fast Distributed Training
AU - Wang, Haozhao
AU - Qu, Zhihao
AU - Guo, Song
AU - Wang, Ningqi
AU - Li, Ruixuan
AU - Zhuang, Weihua
N1 - Funding Information:
Manuscript received July 3, 2020; revised March 10, 2021; accepted April 21, 2021. Date of publication June 7, 2021; date of current version July 16, 2021. This work was supported in part by the National Key Research and Development Program of China under Grant 2016YFB0800402; in part by the Funding from Hong Kong RGC Research Impact Fund (RIF) under Project R5060-19 and Project R5034-18; in part by the General Research Fund (GRF) under Project 152221/19E and Project 15220320/20E; in part by the Collaborative Research Fund (CRF) under Project C5026-18G; in part by the National Natural Science Foundation of China under Grant 61872310, Grant U1836204, and Grant U1936108; in part by the Shenzhen Science and Technology Innovation Commission under Grant R2020A045; in part by the Shenzhen Basic Research Funding Scheme under Grant JCYJ20170818103849343; and in part by the China Postdoctoral Science Foundation under Grant 2019M661709. This is an extended revision of the early version appeared in [1]. (Corresponding authors: Song Guo; Ruixuan Li.) Haozhao Wang is with the School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China, and also with the Department of Computing, The Hong Kong Polytechnic University, Hong Kong (e-mail: [email protected]).
Publisher Copyright:
© 1983-2012 IEEE.
PY - 2021/8
Y1 - 2021/8
N2 - When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (D-SGD) incurs significant communication delays and huge communication overhead due to the model synchronization. Moreover, considering the heterogeneity of computational capability among workers, traditional synchronization modes incur under-utilization of computational resources because fast workers have to wait for slow ones finishing the computation. Although our previous work OSP can effectively solve these problems by overlapping the computation and communication procedures and allowing adaptive multiple local updates in distributed training, it causes the staleness problem brought by the overlap, yielding a performance degradation. In this paper, we propose a new method named LOSP by introducing local compensation to our previous synchronization mechanism, which mitigates adverse effects caused by the overlapping synchronization. We theoretically prove that LOSP (1) preserves the same convergence rate as the sequential SGD for non-convex problems, and (2) exhibits good scalability due to the linear speedup property with respect to both the number of workers and the average number of local updates. Evaluations show that LOSP significantly improves performance over the state-of-the-art ones in terms of both convergence accuracy and communication cost.
AB - When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (D-SGD) incurs significant communication delays and huge communication overhead due to the model synchronization. Moreover, considering the heterogeneity of computational capability among workers, traditional synchronization modes incur under-utilization of computational resources because fast workers have to wait for slow ones finishing the computation. Although our previous work OSP can effectively solve these problems by overlapping the computation and communication procedures and allowing adaptive multiple local updates in distributed training, it causes the staleness problem brought by the overlap, yielding a performance degradation. In this paper, we propose a new method named LOSP by introducing local compensation to our previous synchronization mechanism, which mitigates adverse effects caused by the overlapping synchronization. We theoretically prove that LOSP (1) preserves the same convergence rate as the sequential SGD for non-convex problems, and (2) exhibits good scalability due to the linear speedup property with respect to both the number of workers and the average number of local updates. Evaluations show that LOSP significantly improves performance over the state-of-the-art ones in terms of both convergence accuracy and communication cost.
KW - distributed machine learning
KW - local compensation
KW - Overlap synchronization parallel
KW - parameter server
UR - http://www.scopus.com/inward/record.url?scp=85110639581&partnerID=8YFLogxK
U2 - 10.1109/JSAC.2021.3087272
DO - 10.1109/JSAC.2021.3087272
M3 - Journal article
AN - SCOPUS:85110639581
SN - 0733-8716
VL - 39
SP - 2541
EP - 2557
JO - IEEE Journal on Selected Areas in Communications
JF - IEEE Journal on Selected Areas in Communications
IS - 8
M1 - 9448017
ER -