Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop All-reduce with Ultimate Compression

  • Feijie Wu
  • , Shiqi He
  • , Song Guo
  • , Zhihao Qu
  • , Haozhao Wang
  • , Weihua Zhuang
  • , Jie Zhang

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

8 Citations (Scopus)

Abstract

Traditional one-bit compressed stochastic gradient descent can not be directly employed in multi-hop all-reduce, a widely adopted distributed training paradigm in network-intensive high-performance computing systems such as public clouds. According to our theoretical findings, due to the cascading compression, the training process has considerable deterioration on the convergence performance. To overcome this limitation, we implement a sign-bit compression-based learning synchronization framework, Marsit. It prevents cascading compression via an elaborate bit-wise operation for unbiased sign aggregation and its specific global compensation mechanism for mitigating compression deviation. The proposed framework retains the same theoretical convergence rate as non-compression mechanisms. Experimental results demonstrate that Marsit reduces up to 35% training time while preserving the same accuracy as training without compression.

Original languageEnglish
Title of host publicationProceedings of the 59th ACM/IEEE Design Automation Conference, DAC 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages193-198
Number of pages6
ISBN (Electronic)9781450391429
DOIs
Publication statusPublished - 10 Jul 2022
Event59th ACM/IEEE Design Automation Conference, DAC 2022 - San Francisco, United States
Duration: 10 Jul 202214 Jul 2022

Publication series

NameProceedings - Design Automation Conference
ISSN (Print)0738-100X

Conference

Conference59th ACM/IEEE Design Automation Conference, DAC 2022
Country/TerritoryUnited States
CitySan Francisco
Period10/07/2214/07/22

Keywords

  • distributed machine learning
  • multi-hop all-reduce
  • signSGD

ASJC Scopus subject areas

  • Computer Science Applications
  • Control and Systems Engineering
  • Electrical and Electronic Engineering
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop All-reduce with Ultimate Compression'. Together they form a unique fingerprint.

Cite this