Run Data: Re-distributing data via piggybacking for geo-distributed data analytics over edges

Yibo Jin, Zhuzhong Qian, Song Guo, Sheng Zhang, Lei Jiao, Sanglu Lu

Research output: Journal article publicationJournal articleAcademic researchpeer-review

1 Citation (Scopus)


Efficiently analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Since the datasets are often generated in closer proximity to end users, traditional works mainly focus on offloading proper tasks from those hotspot edges to the datacenter to decrease the overall completion time of submitted jobs in a one-shot manner. However, optimizing the completion time of current job alone is insufficient in a long-term scope since some datasets would be used multiple times. Instead, optimizing the data distribution is much more efficient and could directly benefit forthcoming jobs, although it may postpone the execution of current one. Unfortunately, due to the throwaway feature of data fetcher, existing data analytics systems fail to re-distribute corresponding data out of hotspot edges after the execution of data analytics. In order to minimize the overall completion time for a sequence of jobs as well as to guarantee the performance of current one, we propose to re-distribute the data along with task offloading, and formulate corresponding ϵ-bounded data-driven task scheduling problem over wide area network under the consideration of edge heterogeneity. We design an online schema runData, which offloads proper tasks and related data via piggybacking to the datacenter based on delicately calculated probabilities. Through rigorous theoretical analysis, runData is proved concentrated on its optimum with high probability. We implement runData based on Spark and HDFS. Both testbed results and trace-driven simulations show that runData re-distributes proper data via piggybacking and achieves up to 37 percent reduction on average response time compared with state-of-the-art schemas.

Original languageEnglish
Article number9446574
Pages (from-to)40-55
Number of pages16
JournalIEEE Transactions on Parallel and Distributed Systems
Issue number1
Publication statusPublished - 1 Jan 2022


  • Cloud-edge system
  • Data re-distribution
  • Heterogeneity
  • Online schema

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics


Dive into the research topics of 'Run Data: Re-distributing data via piggybacking for geo-distributed data analytics over edges'. Together they form a unique fingerprint.

Cite this