TY - GEN
T1 - ran-GJS
T2 - 47th International Conference on Parallel Processing, ICPP 2018
AU - Jin, Yibo
AU - Zhang, Sheng
AU - Qian, Zhuzhong
AU - Wang, Xiaoliang
AU - Guo, Song
AU - Lu, Sanglu
PY - 2018/8/13
Y1 - 2018/8/13
N2 - Many organizations and companies have deployed not only datacenters but also large number of geo-distributed heterogeneous edges to provide fast data analytics services. Since large volume of data transmission across WAN can be costly, existing works mainly focus on pre-processing data in-place to avoid transmission. However, the heterogeneity of edges on either local computing capacity or network bandwidth limits the efficient use on scarce resource, which may result in long task completion time. To cope with dynamic demands on scarce resource, we take the heterogeneity of both computing capacity and network bandwidth of geo-distributed edges into consideration when assigning data analytical tasks and their associated data between the central datacenter and edges such that the overall latency can be reduced. We formulate the geo-distributed data-task joint scheduling problem (GJS), show its NP-hardness, and propose a near-optimal randomized scheduling algorithm (ran-GJS). ran-GJS can be proved concentrated around its optimum value with high probability, i.e., 1-O(e-t2) where t is the concentration bound by using Martingale Analysis. The experimental results obtained form both extensive simulations and Yarn-based prototype show that ran-GJS significantly speeds up the geo-distributed analytics with a gain on average completion time of at least 28% over state-of-the-art baseline algorithms.
AB - Many organizations and companies have deployed not only datacenters but also large number of geo-distributed heterogeneous edges to provide fast data analytics services. Since large volume of data transmission across WAN can be costly, existing works mainly focus on pre-processing data in-place to avoid transmission. However, the heterogeneity of edges on either local computing capacity or network bandwidth limits the efficient use on scarce resource, which may result in long task completion time. To cope with dynamic demands on scarce resource, we take the heterogeneity of both computing capacity and network bandwidth of geo-distributed edges into consideration when assigning data analytical tasks and their associated data between the central datacenter and edges such that the overall latency can be reduced. We formulate the geo-distributed data-task joint scheduling problem (GJS), show its NP-hardness, and propose a near-optimal randomized scheduling algorithm (ran-GJS). ran-GJS can be proved concentrated around its optimum value with high probability, i.e., 1-O(e-t2) where t is the concentration bound by using Martingale Analysis. The experimental results obtained form both extensive simulations and Yarn-based prototype show that ran-GJS significantly speeds up the geo-distributed analytics with a gain on average completion time of at least 28% over state-of-the-art baseline algorithms.
UR - http://www.scopus.com/inward/record.url?scp=85054878051&partnerID=8YFLogxK
U2 - 10.1145/3225058.3225059
DO - 10.1145/3225058.3225059
M3 - Conference article published in proceeding or book
AN - SCOPUS:85054878051
SN - 9781450365109
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018
PB - Association for Computing Machinery
Y2 - 14 August 2018 through 16 August 2018
ER -