Falcon: Addressing Stragglers in Heterogeneous Parameter Server via Multiple Parallelism

Qihua Zhou, Song Guo, Haodong Lu, Li Li, Minyi Guo, Yanfei Sun, Kun Wang

Research output: Journal article publicationJournal articleAcademic researchpeer-review

Abstract

Parameter server architecture has shown great performance superiority for handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions may not fully exploit the computation resource of a machine as evidenced by our experiments, especially in the heterogeneous environment. This motivates us to build a heterogeneity-aware parameter server system that addresses stragglers and accelerate DL training from the perspective of computation parallelism. We introduce straggler projection to give a comprehensive inspection of stragglers and solve this problem in two aspects: (1) controlling each worker's training speed via elastic parallelism control and (2) transferring blocked tasks from stragglers to pioneers to fully exploit computation resource. We propose the abstraction of parallelism as an infrastructure and design the Elastic-Parallelism Synchronous Parallel (EPSP) algorithm to handle distributed training and parameter synchronization, supporting both enforced- and slack-synchronization schemes. The whole idea has been implemented in a prototype called Falcon to accelerate the DL training speed with stragglers. Evaluation under various benchmarks demonstrates the superiority of our system. Specifically, Falcon reduces training convergence time, by up to 61.83%, 55.19%, 38.92% and 2.:68% shorter than FlexRR, Sync-opt, ConSGD and DynSGD, respectively.

Original languageEnglish
JournalIEEE Transactions on Computers
DOIs
Publication statusAccepted/In press - 2020

Keywords

  • Distributed Deep Learning
  • Heterogeneous Environment
  • Parameter Server
  • Straggler

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this