Abstract
Big data jobs are usually executed on large-scale distributed computing platforms that automatically divide a job into multiple computation phases, each of which contains a number of independent tasks that can run in parallel. The data shuffling process between two consecutive phases becomes the bottleneck of job execution. To improve its performance, an approach of 'push' shuffling is proposed to send intermediate results to next phase immediately once they are generated. It avoids local disk accesses in the traditional 'pull' shuffling approach, and tasks in the next phase can start data processing without waiting tasks in the predecessive phase to finish. Task duplication is another approach to accelerate task execution by launching multiple task copies that compete for processing the same data block. When 'push' shuffling meets task duplication, big data jobs can be significantly accelerated, but leading to a large amount of redundant data transmission between two phases. To address this challenge, we propose a software-define data shuffling approach by designing a controller and a janitor module to control the data shuffling process. Each task has a janitor that communicates with the controller to request admission permit of sending intermediate results to next-stage tasks. We further propose an online grouping algorithm to reduce the overhead of frequent communication with the controller. The performance of the proposed algorithm is evaluated by extensive simulations.
Original language | English |
---|---|
Title of host publication | Proceedings - 45th International Conference on Parallel Processing Workshops, ICPPW 2016 |
Publisher | IEEE |
Pages | 403-407 |
Number of pages | 5 |
Volume | 2016-September |
ISBN (Electronic) | 9781509028252 |
DOIs | |
Publication status | Published - 23 Sept 2016 |
Externally published | Yes |
Event | 45th International Conference on Parallel Processing Workshops, ICPPW 2016 - Philadelphia, United States Duration: 16 Aug 2016 → 19 Aug 2016 |
Conference
Conference | 45th International Conference on Parallel Processing Workshops, ICPPW 2016 |
---|---|
Country/Territory | United States |
City | Philadelphia |
Period | 16/08/16 → 19/08/16 |
Keywords
- MapReduce
- shuffling
- task duplication
- traffic
ASJC Scopus subject areas
- Software
- General Mathematics
- Hardware and Architecture