GSSP: Eliminating Stragglers through Grouping Synchronous for Distributed Deep Learning in Heterogeneous Cluster

Haifeng Sun, Zhiyi Gui, Song Guo, Qi Qi, Jingyu Wang, Jianxi Liao

Research output: Journal article publicationJournal articleAcademic researchpeer-review

2 Citations (Scopus)

Abstract

Distributed deep learning has been widely used in training deep neural networks, especially for big models on massive datasets. Parameter Server (PS) architecture is the most popular distributed training framework, which can flexibly design the global parameter update manner. However, when scaling to complex heterogeneous clusters, stragglers make it difficult for existing distributed paradigms on PS framework to balance between synchronous waiting and staleness, which slows down the model training sharply. In this paper, we propose Grouping Stale Synchronous Parallel (GSSP) scheme, which groups workers with similar performance together. Group servers coordinate intra-group workers using Stale Synchronous Parallel while they communicate with each other asynchronously to eliminate stragglers and refine the model weights. We further propose Grouping Dynamic Tok-K Sparsification (GDTopK), which dynamically adjusts the upload ratio for each group so as to make communication volume differentiated and mitigate inter-group iteration speed gap. We have conducted experiments on LeNet-5 on MNIST, ResNet-18, VGG-19 on Cifar-10 and Seq2Seq on Multi30k. Results show that GSSP accelerates the training by 46%120%, with less than 1% accuracy drop. And GDTopK can make up for part of the lost accuracy.

Original languageEnglish
Pages (from-to)2637 - 2648
JournalIEEE Transactions on Cloud Computing
DOIs
Publication statusPublished - Feb 2021

Keywords

  • Computational modeling
  • Computer architecture
  • Data models
  • Deep learning
  • deep learning
  • distributed training
  • gradient compression
  • parameter server
  • Servers
  • Synchronization
  • top-k sparsification
  • Training

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Hardware and Architecture
  • Computer Science Applications
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'GSSP: Eliminating Stragglers through Grouping Synchronous for Distributed Deep Learning in Heterogeneous Cluster'. Together they form a unique fingerprint.

Cite this