OSP: Overlapping computation and communication in parameter server for fast machine learning

Haozhao Wang, Song Guo, Ruixuan Li

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

12 Citations (Scopus)

Abstract

When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays because after pushing their updates, computing nodes (workers) have to wait for the global model to be communicated back from the master in every iteration. In this paper, we devise a new synchronization parallel mechanism named overlap synchronization parallel (OSP), in which the waiting time is removed by conducting computation and communication in an overlapped manner. We theoretically prove that our mechanism could achieve the same convergence rate compared to the sequential SGD for non-convex problems. Evaluations show that our mechanism significantly improves performance over the state-of-the-art ones, e.g., by 4× for both AlexNet and ResNet18 in terms of convergence speed.

Original languageEnglish
Title of host publicationProceedings of the 48th International Conference on Parallel Processing, ICPP 2019
PublisherAssociation for Computing Machinery
Pages1-10
ISBN (Electronic)9781450362955
DOIs
Publication statusPublished - 5 Aug 2019
Event48th International Conference on Parallel Processing, ICPP 2019 - Kyoto, Japan
Duration: 5 Aug 20198 Aug 2019

Publication series

NameACM International Conference Proceeding Series

Conference

Conference48th International Conference on Parallel Processing, ICPP 2019
Country/TerritoryJapan
CityKyoto
Period5/08/198/08/19

Keywords

  • Distributed
  • Machine learning
  • Parameter Server
  • SGD
  • Synchronization

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Fingerprint

Dive into the research topics of 'OSP: Overlapping computation and communication in parameter server for fast machine learning'. Together they form a unique fingerprint.

Cite this