Software-Defined Data Shuffling for Big Data Jobs with Task Duplication

Qimeng Zang, Hsiang Yu Chan, Peng Li, Song Guo

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review


Big data jobs are usually executed on large-scale distributed computing platforms that automatically divide a job into multiple computation phases, each of which contains a number of independent tasks that can run in parallel. The data shuffling process between two consecutive phases becomes the bottleneck of job execution. To improve its performance, an approach of 'push' shuffling is proposed to send intermediate results to next phase immediately once they are generated. It avoids local disk accesses in the traditional 'pull' shuffling approach, and tasks in the next phase can start data processing without waiting tasks in the predecessive phase to finish. Task duplication is another approach to accelerate task execution by launching multiple task copies that compete for processing the same data block. When 'push' shuffling meets task duplication, big data jobs can be significantly accelerated, but leading to a large amount of redundant data transmission between two phases. To address this challenge, we propose a software-define data shuffling approach by designing a controller and a janitor module to control the data shuffling process. Each task has a janitor that communicates with the controller to request admission permit of sending intermediate results to next-stage tasks. We further propose an online grouping algorithm to reduce the overhead of frequent communication with the controller. The performance of the proposed algorithm is evaluated by extensive simulations.
Original languageEnglish
Title of host publicationProceedings - 45th International Conference on Parallel Processing Workshops, ICPPW 2016
Number of pages5
ISBN (Electronic)9781509028252
Publication statusPublished - 23 Sept 2016
Externally publishedYes
Event45th International Conference on Parallel Processing Workshops, ICPPW 2016 - Philadelphia, United States
Duration: 16 Aug 201619 Aug 2016


Conference45th International Conference on Parallel Processing Workshops, ICPPW 2016
Country/TerritoryUnited States


  • MapReduce
  • shuffling
  • task duplication
  • traffic

ASJC Scopus subject areas

  • Software
  • General Mathematics
  • Hardware and Architecture


Dive into the research topics of 'Software-Defined Data Shuffling for Big Data Jobs with Task Duplication'. Together they form a unique fingerprint.

Cite this