Heterogeneity-aware gradient coding for straggler tolerance

Haozhao Wang, Song Guo, Bin Tang, Ruixuan Li, Chengjie Li

Research output: Unpublished conference presentation (presented paper, abstract, poster)Conference presentation (not published in journal/proceeding/book)Academic researchpeer-review

5 Citations (Scopus)

Abstract

Gradient descent algorithms are widely used in machine learning. In order to deal with huge volume of data, we consider the implementation of gradient descent algorithms in a distributed computing setting where multiple workers compute the gradient over some partial data and the master node aggregates their results to obtain the gradient over the whole data. However, its performance can be severely affected by straggler workers. Recently, some coding-based approaches are introduced to mitigate the straggler problem, but they are efficient only when the workers are homogeneous, i.e., having the same computation capabilities. In this paper, we consider that the workers are heterogenous which are common in modern distributed systems. We propose a novel heterogeneity-aware gradient coding scheme which can not only tolerate a predetermined number of stragglers but also fully utilize the computation capabilities of heterogenous workers. We show that this scheme is optimal when the computation capabilities of workers are estimated accurately. A variant of this scheme is further proposed to improve the performance when the estimations of the computation capabilities are not so accurate. We conduct our schemes for gradient descent based image classification on QingCloud clusters. Evaluation results show that our schemes can reduce the whole computation time by up to 3× compared with a state-of-the-art coding scheme.

Original languageEnglish
Pages555-564
Number of pages10
DOIs
Publication statusPublished - Jul 2019
Event39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019 - Richardson, United States
Duration: 7 Jul 20199 Jul 2019

Conference

Conference39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019
Country/TerritoryUnited States
CityRichardson
Period7/07/199/07/19

Keywords

  • Gradient coding
  • Heterogeneity-aware
  • Modern distributed system
  • Straggler tolerance

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this