MIPD: An Adaptive Gradient Sparsification Framework for Distributed DNNs Training

Zhaorui Zhang, Choli Wang

Research output: Journal article publicationJournal articleAcademic researchpeer-review

17 Citations (Scopus)

Abstract

Asynchronous training based on the parameter server architecture is widely used for scaling up the DNN training over large datasets and DNN models. Communication has been identified as the major bottleneck when deploying the DNN training over the large-scale distributed deep learning systems. Recent studies try to reduce the communication traffic through gradient sparsification and quantization approaches. We identify three limitations in previous studies. First, the fundamental guideline for gradient sparsification of their work is the magnitude of the gradient. However, the gradients' magnitude represents the current optimization direction while it cannot indicate the significance of the parameters, which potentially results in delayed updating for the significant parameters. Second, their gradient quantization methods based on the entire model often lead to error accumulation for gradients aggregation since the gradients from different layers of the DNN model follow different distributions. Third, previous quantization approaches are CPU intensive, which generates strong overhead for the server. We propose MIPD, an adaptive and layer-wise gradient sparsification framework that compresses the gradients based on model interpretability and probability distribution of gradients. MIPD compresses the gradients according to the corresponding significance of its parameters, which is defined by model interpretability. An Exponential Smoothing method is also proposed to compensate for the dropped gradients on the server to reduce the gradients error. MIPD proposes to update half of the parameters for each training step to reduce the CPU overhead of the server. It encodes the gradients based on their probability distribution, thereby minimizing the approximated errors. Extensive experimental results generated on the GPU cluster indicate that the proposed framework effectively improves the training performance of DNNs by up to 36.2%, which ensures high accuracy as compared to state-of-art solutions. Accordingly, the CPU and network usage of the server dropped by up to 42.0% and 32.7% respectively.

Original languageEnglish
Pages (from-to)3053-3066
Number of pages14
JournalIEEE Transactions on Parallel and Distributed Systems
Volume33
Issue number11
DOIs
Publication statusPublished - 1 Nov 2022

Keywords

  • Exponential smoothing prediction
  • gradients sparsification
  • model interpretability
  • probability distribution
  • quantization

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'MIPD: An Adaptive Gradient Sparsification Framework for Distributed DNNs Training'. Together they form a unique fingerprint.

Cite this