TY - GEN
T1 - Versatile Datapath Soft Error Detection on the Cheap for HPC Applications
AU - Huang, Yafan
AU - Di, Sheng
AU - Zhang, Zhaorui
AU - Lu, Xiaoyi
AU - Li, Guanpeng
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - With the ongoing reduction in technology sizes and voltage levels, modern microprocessors are increasingly susceptible to soft errors, corrupting datapath units during program execution. While these error types have received considerable attention recently, existing solutions either confine themselves to limited scopes or incur massive overheads in performance and power consumption, hindering practical usage. In this work, we propose CONDA, a novel error detection technique based on code transformation and static program analysis, achieving versatile datapath protection at low cost. At compile time, ConDa analyzes program characteristics and transforms the original program code without complicating its control-flow and memory access patterns. At runtime, ConDa detects datapath errors with low overhead and latency. The evaluation of 38 benchmarks and a parallel HPC simulation reveals that CONDA only incurs 57.79% runtime overhead, which is 41.84% faster than existing state-of-the-art, with the same level of error detection effectiveness and low detection latency.
AB - With the ongoing reduction in technology sizes and voltage levels, modern microprocessors are increasingly susceptible to soft errors, corrupting datapath units during program execution. While these error types have received considerable attention recently, existing solutions either confine themselves to limited scopes or incur massive overheads in performance and power consumption, hindering practical usage. In this work, we propose CONDA, a novel error detection technique based on code transformation and static program analysis, achieving versatile datapath protection at low cost. At compile time, ConDa analyzes program characteristics and transforms the original program code without complicating its control-flow and memory access patterns. At runtime, ConDa detects datapath errors with low overhead and latency. The evaluation of 38 benchmarks and a parallel HPC simulation reveals that CONDA only incurs 57.79% runtime overhead, which is 41.84% faster than existing state-of-the-art, with the same level of error detection effectiveness and low detection latency.
KW - Code Transformation
KW - Compiler
KW - Datapath Protection
KW - High-Performance Computing (HPC)
KW - Reliability
KW - Soft Errors
UR - https://www.scopus.com/pages/publications/85214989257
U2 - 10.1109/SC41406.2024.00061
DO - 10.1109/SC41406.2024.00061
M3 - Conference article published in proceeding or book
AN - SCOPUS:85214989257
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2024
PB - IEEE Computer Society
T2 - 2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024
Y2 - 17 November 2024 through 22 November 2024
ER -