Online load balancing for MapReduce with skewed data input

Yanfang Le, Jiangchuan Liu, Funda Ergün, Dan Wang

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

54 Citations (Scopus)

Abstract

MapReduce has emerged as a powerful tool for distributed and scalable processing of voluminous data. In this paper, we, for the first time, examine the problem of accommodating data skew in MapReduce with online operations. Different from earlier heuristics in the very late reduce stage or after seeing all the data, we address the skew from the beginning of data input, and make no assumption about a priori knowledge of the data distribution nor require synchronized operations. We examine the input in a continuous fashion and adaptively assign tasks with a load-balanced strategy. We show that the optimal strategy is a constrained version of online minimum makespan and, in the MapReduce context where pairs with identical keys must be scheduled to the same machine, there is an online algorithm with a provable 2-competitive ratio. We further suggest a sample-based enhancement, which, probabilistically, achieves a 3/2-competitive ratio with a bounded error.
Original languageEnglish
Title of host publicationIEEE INFOCOM 2014 - IEEE Conference on Computer Communications
PublisherIEEE
Pages2004-2012
Number of pages9
ISBN (Print)9781479933600
DOIs
Publication statusPublished - 1 Jan 2014
Event33rd IEEE Conference on Computer Communications, IEEE INFOCOM 2014 - Toronto, ON, Canada
Duration: 27 Apr 20142 May 2014

Conference

Conference33rd IEEE Conference on Computer Communications, IEEE INFOCOM 2014
Country/TerritoryCanada
CityToronto, ON
Period27/04/142/05/14

ASJC Scopus subject areas

  • General Computer Science
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Online load balancing for MapReduce with skewed data input'. Together they form a unique fingerprint.

Cite this