Authors: Hengshan Yue and Xiaohui Wei (Jilin University, China); Guangli Li (Institute of Computing Technology, Chinese Academy of Sciences); and Jianpeng Zhao, Nan Jiang, and Jingweijia Tan (Jilin University, China)
Abstract: As GPUs become ubiquitous in large-scale HPC systems, ensuring the reliable execution of such systems in the presence of soft errors is increasingly essential. To assess GPGPU programs' resilience toward soft errors, researchers rely on Random Fault Injection (FI) method. However, it is prohibitively expensive to obtain a statistically significant resilience profile and not suitable for identifying all the critical bits of GPGPU programs.
To address these challenges, in this work, we build a GPGPU-based Soft Error Prediction Model (G-SEPM) to estimate fault site resiliency. We observe that the instruction-type, bit-position, bit-flip direction, and error propagation chain have capabilities to characterize fault site resiliency. Leveraging these heuristic features, G-SEPM drives out the machine learning model to reveal the hidden interactions among fault site resiliency and our proposed features. Experimental results demonstrate that G-SEPM achieves high accuracy for fault site error estimation and critical bit identification while introducing negligible overhead.
Presentation: file
Back to Technical Papers Archive Listing