Topology-aware job scheduling for machine learning cluster

Jingyuan Lu, Peng Li, Kun Wang, Huibin Feng, Enting Guo, Xiaoyan Wang, Song Guo

Research output: Unpublished conference presentation (presented paper, abstract, poster)Conference presentation (not published in journal/proceeding/book)Academic researchpeer-review

Abstract

Parameter Server (PS) has been widely used to train a large amount of data on multiple machines in parallel. In parameter server, a critical problem is how to effectively schedule multiple training jobs to minimize the job completion time. Some existing work has proposed methods of setting the number of concurrent workers. However, they do not effectively consider the topology of GPU placement which affects the efficiency of communication. This paper proposes a novel resource-to-time model based on the number of workers and the topology of GPU placement. According to the model, we propose an algorithm called TOPO-PS particularly for topology problem in parameter servers. The algorithm achieves the placement strategy based on graph mapping algorithm. Evaluation under various algorithms evidences the superiority of our algorithm. TOPO-PS yields shorter job completion, by up to 53.48% of that of FIFO and 88.77% of OASIS.

Original languageEnglish
DOIs
Publication statusPublished - Dec 2019
Externally publishedYes
Event2019 IEEE Global Communications Conference, GLOBECOM 2019 - Waikoloa, United States
Duration: 9 Dec 201913 Dec 2019

Conference

Conference2019 IEEE Global Communications Conference, GLOBECOM 2019
CountryUnited States
CityWaikoloa
Period9/12/1913/12/19

Keywords

  • Cloud Computing
  • Machine Learning
  • Parameter Server
  • Scheduling Algorithms

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems
  • Signal Processing
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality
  • Media Technology
  • Health Informatics

Cite this