TY - JOUR
T1 - A Comprehensive Survey on Training Acceleration for Large Machine Learning Models in IoT
AU - Wang, Haozhao
AU - Qu, Zhihao
AU - Zhou, Qihua
AU - Zhang, Haobo
AU - Luo, Boyuan
AU - Xu, Wenchao
AU - Guo, Song
AU - Li, Ruixuan
N1 - Funding Information:
This work was supported in part by the National Key Research and Development Program of China under Grant 2016YFB0800402; in part by the Hong Kong RGC Research Impact Fund (RIF) under Project R5060-19; in part by the General Research Fund (GRF) under Project 152221/19E and Project 15220320/20E; in part by the Collaborative Research Fund (CRF) under Project C5026-18G; in part by the National Natural Science Foundation of China under Grant 61872310, Grant U1836204, and Grant U1936108; in part by the Major Projects of the National Social Science Foundation under Grant 16ZDA092; in part by Shenzhen Science and Technology Innovation Commission under Grant R2020A045; in part by the Fundamental Research Funds for the Central Universities under Grant 200202176 and Grant 210202079; in part by the China Postdoctoral Science Foundation under Grant 2019M661709; and in part by the Research Grants Council of the Hong Kong Special Administrative Region, China, under Project PolyU15222621.
Publisher Copyright:
© 2014 IEEE.
PY - 2022/1/15
Y1 - 2022/1/15
N2 - The ever-growing artificial intelligence (AI) applications have greatly reshaped our world in many areas, e.g., smart home, computer vision, natural language processing, etc. Behind these applications are usually machine learning (ML) models with extremely large size, which require huge data sets for accurate training to mine the value contained in the big data. Large ML models, however, can consume tremendous computing resources to achieve decent performance and thus, it is difficult to train them in resource-constrained Internet of Things (IoT) environments, which would prevent further development and application of AI techniques in the future. To deal with such challenges, there are many efforts on accelerating the training process for large ML models in IoT. In this article, we provide a comprehensive review on the recent advances toward reducing the computing cost during the training stage while maintaining comparable model accuracy. Specifically, the optimization algorithms that aim to improve the convergence rate are emphasized over various distributed learning architectures that exploit ubiquitous computing resources. Then, the article elaborates the computation hardware acceleration and communication optimization for collaborative training among multiple learning entities. Finally, the remaining challenges, future opportunities, and possible directions are discussed.
AB - The ever-growing artificial intelligence (AI) applications have greatly reshaped our world in many areas, e.g., smart home, computer vision, natural language processing, etc. Behind these applications are usually machine learning (ML) models with extremely large size, which require huge data sets for accurate training to mine the value contained in the big data. Large ML models, however, can consume tremendous computing resources to achieve decent performance and thus, it is difficult to train them in resource-constrained Internet of Things (IoT) environments, which would prevent further development and application of AI techniques in the future. To deal with such challenges, there are many efforts on accelerating the training process for large ML models in IoT. In this article, we provide a comprehensive review on the recent advances toward reducing the computing cost during the training stage while maintaining comparable model accuracy. Specifically, the optimization algorithms that aim to improve the convergence rate are emphasized over various distributed learning architectures that exploit ubiquitous computing resources. Then, the article elaborates the computation hardware acceleration and communication optimization for collaborative training among multiple learning entities. Finally, the remaining challenges, future opportunities, and possible directions are discussed.
KW - Distributed machine learning (ML)
KW - hardware-aided acceleration
KW - large model training
KW - training acceleration
UR - http://www.scopus.com/inward/record.url?scp=85114719754&partnerID=8YFLogxK
U2 - 10.1109/JIOT.2021.3111624
DO - 10.1109/JIOT.2021.3111624
M3 - Journal article
AN - SCOPUS:85114719754
SN - 2327-4662
VL - 9
SP - 939
EP - 963
JO - IEEE Internet of Things Journal
JF - IEEE Internet of Things Journal
IS - 2
ER -