TY - JOUR
T1 - Falcon: Addressing Stragglers in Heterogeneous Parameter Server via Multiple Parallelism
AU - Zhou, Qihua
AU - Guo, Song
AU - Lu, Haodong
AU - Li, Li
AU - Guo, Minyi
AU - Sun, Yanfei
AU - Wang, Kun
N1 - Funding Information:
This work was financially supported in part by National Key Research and Development Program of China under Grant 2018YFB1003500; in part by The General Research Fund of the Research Grants Council of Hong Kong (PolyU 152221/ 19E), in part by The National Natural Science Foundation of China under Grants 61872310, 61772286, 61872195, 61872240, 61832006, 61572262, and 61802208, in part by Jiangsu Key Research and Development Program (BE2019742), China, and in part by the Natural Science Foundation of Jiangsu Province (BK20191381), China. The preliminary version of this article, titled “Falcon: Towards Computation-Parallel Deep Learning in Heterogeneous Parameter Server” [49], was published in IEEE ICDCS 2019.
Publisher Copyright:
© 1968-2012 IEEE.
PY - 2021/1/1
Y1 - 2021/1/1
N2 - The parameter server architecture has shown promising performance advantages when handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving stragglers may not fully exploit the computation resource of the cluster as evidenced by our experiments, especially in the heterogeneous environment. This motivates us to design a heterogeneity-aware parameter server paradigm that addresses stragglers and accelerates DL training from the perspective of computation parallelism. We introduce a novel methodology named straggler projection to give a comprehensive inspection of stragglers and reveal practical guidelines to solve this problem in two aspects: (1) controlling each worker's training speed via elastic training parallelism control and (2) transferring blocked tasks from stragglers to pioneers to fully utilize the computation resource. Following these guidelines, we propose the abstraction of parallelism as an infrastructure and design the Elastic-Parallelism Synchronous Parallel (EPSP) algorithm to handle distributed training and parameter synchronization, supporting both enforced- A nd slack-synchronization schemes. The whole idea has been implemented into a prototype called ${\sf Falcon}$Falcon which effectively accelerates the DL training speed with the presence of stragglers. Evaluation under various benchmarks with baseline comparison demonstrates the superiority of our system. Specifically, ${\sf Falcon}$Falcon reduces the training convergence time, by up to 61.83, 55.19, 38.92, and 23.68 percent shorter than FlexRR, Sync-opt, ConSGD, and DynSGD, respectively.
AB - The parameter server architecture has shown promising performance advantages when handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving stragglers may not fully exploit the computation resource of the cluster as evidenced by our experiments, especially in the heterogeneous environment. This motivates us to design a heterogeneity-aware parameter server paradigm that addresses stragglers and accelerates DL training from the perspective of computation parallelism. We introduce a novel methodology named straggler projection to give a comprehensive inspection of stragglers and reveal practical guidelines to solve this problem in two aspects: (1) controlling each worker's training speed via elastic training parallelism control and (2) transferring blocked tasks from stragglers to pioneers to fully utilize the computation resource. Following these guidelines, we propose the abstraction of parallelism as an infrastructure and design the Elastic-Parallelism Synchronous Parallel (EPSP) algorithm to handle distributed training and parameter synchronization, supporting both enforced- A nd slack-synchronization schemes. The whole idea has been implemented into a prototype called ${\sf Falcon}$Falcon which effectively accelerates the DL training speed with the presence of stragglers. Evaluation under various benchmarks with baseline comparison demonstrates the superiority of our system. Specifically, ${\sf Falcon}$Falcon reduces the training convergence time, by up to 61.83, 55.19, 38.92, and 23.68 percent shorter than FlexRR, Sync-opt, ConSGD, and DynSGD, respectively.
KW - Distributed Deep Learning
KW - Heterogeneous Environment
KW - Parameter Server
KW - Straggler
UR - http://www.scopus.com/inward/record.url?scp=85079669289&partnerID=8YFLogxK
U2 - 10.1109/TC.2020.2974461
DO - 10.1109/TC.2020.2974461
M3 - Journal article
AN - SCOPUS:85079669289
SN - 0018-9340
VL - 70
SP - 139
EP - 155
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 1
M1 - 9000921
ER -