TY - GEN
T1 - GHive: Accelerating Analytical Query Processing in Apache Hive via CPU-GPU Heterogeneous Computing
AU - Liu, Haotian
AU - Tang, Bo
AU - Zhang, Jiashu
AU - Deng, Yangshen
AU - Yan, Xiao
AU - Zheng, Xinying
AU - Shen, Qiaomu
AU - Zeng, Dan
AU - Mao, Zunyao
AU - Zhang, Chaozu
AU - You, Zhengxin
AU - Wang, Zhihao
AU - Jiang, Runzhe
AU - Wang, Fang
AU - Yiu, Man Lung
AU - Li, Huan
AU - Han, Mingji
AU - Li, Qian
AU - Luo, Zhenghai
N1 - Funding Information:
We are grateful to the anonymous reviewers and the shepherd Krishna Kantikiran Pasupuleti for their constructive comments and insightful suggestions on this paper. This work was supported by the Guangdong Provincial Key Laboratory (2020B121201001), Guangdong Basic and Applied Basic Research Foundation (2021A1515110067) and a research gift from Huawei. Dr. Bo Tang and Dr. Xiao Yan are the corresponding authors.
Publisher Copyright:
© 2022 ACM.
PY - 2022/11/7
Y1 - 2022/11/7
N2 - As a popular distributed data warehouse system, Apache Hive has been widely used for big data analytics in many organizations. Meanwhile, exploiting the massive parallelism of GPU to accelerate online analytical processing (OLAP) has been extensively explored in the database community. In this paper, we present GHive, which enhances CPU-based Hive via CPU-GPU heterogeneous computing. GHive is designed for the business intelligence applications and provides the same API as Hive for compatibility. To run SQL queries jointly on both CPU and GPU, GHive comes with three key techniques: (i) a novel data model gTable, which is column-based and enables efficient data movement between CPU memory and GPU memory; (ii) a GPU-based operator library Panda, which provides a complete set of SQL operators with extensively optimized GPU implementations; (iii) a hardware-aware MapReduce job placement scheme, which puts jobs judiciously on either GPU or CPU via a cost-based approach. In the experiments, we observe that GHive outperforms Hive in both query processing speed and operating expense on the Star Schema Benchmark (SSB).
AB - As a popular distributed data warehouse system, Apache Hive has been widely used for big data analytics in many organizations. Meanwhile, exploiting the massive parallelism of GPU to accelerate online analytical processing (OLAP) has been extensively explored in the database community. In this paper, we present GHive, which enhances CPU-based Hive via CPU-GPU heterogeneous computing. GHive is designed for the business intelligence applications and provides the same API as Hive for compatibility. To run SQL queries jointly on both CPU and GPU, GHive comes with three key techniques: (i) a novel data model gTable, which is column-based and enables efficient data movement between CPU memory and GPU memory; (ii) a GPU-based operator library Panda, which provides a complete set of SQL operators with extensively optimized GPU implementations; (iii) a hardware-aware MapReduce job placement scheme, which puts jobs judiciously on either GPU or CPU via a cost-based approach. In the experiments, we observe that GHive outperforms Hive in both query processing speed and operating expense on the Star Schema Benchmark (SSB).
UR - http://www.scopus.com/inward/record.url?scp=85143254107&partnerID=8YFLogxK
U2 - 10.1145/3542929.3563503
DO - 10.1145/3542929.3563503
M3 - Conference article published in proceeding or book
AN - SCOPUS:85143254107
T3 - SoCC 2022 - Proceedings of the 13th Symposium on Cloud Computing
SP - 158
EP - 172
BT - SoCC 2022 - Proceedings of the 13th Symposium on Cloud Computing
PB - Association for Computing Machinery, Inc
T2 - 13th Annual ACM Symposium on Cloud Computing, SoCC 2022
Y2 - 7 November 2022 through 11 November 2022
ER -