Abstract
Parameter server architecture has shown great performance superiority for handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions may not fully exploit the computation resource of a machine as evidenced by our experiments, especially in the heterogeneous environment. This motivates us to build a heterogeneity-aware parameter server system that addresses stragglers and accelerate DL training from the perspective of computation parallelism. We introduce straggler projection to give a comprehensive inspection of stragglers and solve this problem in two aspects: (1) controlling each worker's training speed via elastic parallelism control and (2) transferring blocked tasks from stragglers to pioneers to fully exploit computation resource. We propose the abstraction of parallelism as an infrastructure and design the Elastic-Parallelism Synchronous Parallel (EPSP) algorithm to handle distributed training and parameter synchronization, supporting both enforced- and slack-synchronization schemes. The whole idea has been implemented in a prototype called Falcon to accelerate the DL training speed with stragglers. Evaluation under various benchmarks demonstrates the superiority of our system. Specifically, Falcon reduces training convergence time, by up to 61.83%, 55.19%, 38.92% and 2.:68% shorter than FlexRR, Sync-opt, ConSGD and DynSGD, respectively.
Original language | English |
---|---|
Journal | IEEE Transactions on Computers |
DOIs | |
Publication status | Accepted/In press - 2020 |
Keywords
- Distributed Deep Learning
- Heterogeneous Environment
- Parameter Server
- Straggler
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Hardware and Architecture
- Computational Theory and Mathematics