TY - JOUR
T1 - Run Data: Re-distributing data via piggybacking for geo-distributed data analytics over edges
AU - Jin, Yibo
AU - Qian, Zhuzhong
AU - Guo, Song
AU - Zhang, Sheng
AU - Jiao, Lei
AU - Lu, Sanglu
N1 - Funding Information:
This work was supported in part by the National Key R&D Program of China under Grant 2017YFB1001801, in part by the National Science Foundation of China under Grants 61832005 and 61872175, in part by the Ripple Faculty Fellowship, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20181252, in part by the Collaborative Innovation Center of Novel Software Technology and Industrialization, in part by the Nanjing University Innovation and Creative Program for PhD under Grant CXCY19-25, in part by the Hong Kong RGC Research Impact Fund (RIF) through Projects R5060-19 and R5034-18, in part by the General Research Fund (GRF) through Projects 152221/19E and 15220320/20E, in part by the Collaborative Research Fund (CRF) through Project C5026-18G, in part by the National Natural Science Foundation of China under Grant 61872310, and in part by the Shenzhen Science and Technology Innovation Commission under Grant R2020A045. The preliminary version of this work entitled ”Run Data Run! Re-distributing Data via Piggybacking for Geo-distributed Data Analytics” was presented in part at IEEE ISPA 2019 [1].
Publisher Copyright:
© 2021 IEEE.
PY - 2022/1/1
Y1 - 2022/1/1
N2 - Efficiently analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Since the datasets are often generated in closer proximity to end users, traditional works mainly focus on offloading proper tasks from those hotspot edges to the datacenter to decrease the overall completion time of submitted jobs in a one-shot manner. However, optimizing the completion time of current job alone is insufficient in a long-term scope since some datasets would be used multiple times. Instead, optimizing the data distribution is much more efficient and could directly benefit forthcoming jobs, although it may postpone the execution of current one. Unfortunately, due to the throwaway feature of data fetcher, existing data analytics systems fail to re-distribute corresponding data out of hotspot edges after the execution of data analytics. In order to minimize the overall completion time for a sequence of jobs as well as to guarantee the performance of current one, we propose to re-distribute the data along with task offloading, and formulate corresponding ϵ-bounded data-driven task scheduling problem over wide area network under the consideration of edge heterogeneity. We design an online schema runData, which offloads proper tasks and related data via piggybacking to the datacenter based on delicately calculated probabilities. Through rigorous theoretical analysis, runData is proved concentrated on its optimum with high probability. We implement runData based on Spark and HDFS. Both testbed results and trace-driven simulations show that runData re-distributes proper data via piggybacking and achieves up to 37 percent reduction on average response time compared with state-of-the-art schemas.
AB - Efficiently analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Since the datasets are often generated in closer proximity to end users, traditional works mainly focus on offloading proper tasks from those hotspot edges to the datacenter to decrease the overall completion time of submitted jobs in a one-shot manner. However, optimizing the completion time of current job alone is insufficient in a long-term scope since some datasets would be used multiple times. Instead, optimizing the data distribution is much more efficient and could directly benefit forthcoming jobs, although it may postpone the execution of current one. Unfortunately, due to the throwaway feature of data fetcher, existing data analytics systems fail to re-distribute corresponding data out of hotspot edges after the execution of data analytics. In order to minimize the overall completion time for a sequence of jobs as well as to guarantee the performance of current one, we propose to re-distribute the data along with task offloading, and formulate corresponding ϵ-bounded data-driven task scheduling problem over wide area network under the consideration of edge heterogeneity. We design an online schema runData, which offloads proper tasks and related data via piggybacking to the datacenter based on delicately calculated probabilities. Through rigorous theoretical analysis, runData is proved concentrated on its optimum with high probability. We implement runData based on Spark and HDFS. Both testbed results and trace-driven simulations show that runData re-distributes proper data via piggybacking and achieves up to 37 percent reduction on average response time compared with state-of-the-art schemas.
KW - Cloud-edge system
KW - Data re-distribution
KW - Heterogeneity
KW - Online schema
UR - http://www.scopus.com/inward/record.url?scp=85107384326&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2021.3086274
DO - 10.1109/TPDS.2021.3086274
M3 - Journal article
AN - SCOPUS:85107384326
SN - 1045-9219
VL - 33
SP - 40
EP - 55
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 1
M1 - 9446574
ER -