Falcon: Towards Computation-Parallel Deep Learning in Heterogeneous Parameter Server

Qihua Zhou, Kun Wang, Song Guo, Haodong Lu, Li Li, Minyi Guo, Yanfei Sun

Research output: Unpublished conference presentation (presented paper, abstract, poster)Conference presentation (not published in journal/proceeding/book)Academic researchpeer-review

10 Citations (Scopus)

Abstract

Parameter server paradigm has shown great performance superiority for handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving straggler may not fully exploit the computation capacity of a cluster as evidenced by our experiments. This motivates us to make an attempt at building a new parameter server architecture that mitigates and addresses stragglers in heterogeneous DL from the perspective of computation parallelism. We introduce a novel methodology named straggler projection to give a comprehensive inspection of stragglers and reveal practical guidelines for resolving this problem: (1) reducing straggler emergence frequency via elastic parallelism control and (2) transferring blocked tasks to pioneer workers for fully exploiting cluster computation capacity. Following the guidelines, we propose the abstraction of parallelism as an infrastructure and elaborate the Elastic-Parallelism Synchronous Parallel (EPSP) that supports both enforced-and slack-synchronization schemes. The whole idea has been implemented in a prototype called Falcon which efficiently accelerates the DL training progress with the presence of stragglers. Evaluation under various benchmarks with baseline comparison evidences the superiority of our system. Specifically, Falcon yields shorter convergence time, by up to 61.83%, 55.19%, 38.92% and 23.68% reduction over FlexRR, Sync-opt, ConSGD and DynSGD, respectively.

Original languageEnglish
Pages196-206
Number of pages11
DOIs
Publication statusPublished - Jul 2019
Event39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019 - Richardson, United States
Duration: 7 Jul 20199 Jul 2019

Conference

Conference39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019
CountryUnited States
CityRichardson
Period7/07/199/07/19

Keywords

  • Distributed Deep Learning
  • Heterogeneous Environment
  • Parameter Server
  • Straggler

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this