一种基于GPU通用计算的容错方法

A Fault Tolerance Method Based on GPGPU

摘要: 为确保GPU通用计算（GPGPU）程序在CPU-GPU异构平台上运行的可靠性，设计了一种以软件方法实现的容错模型.在分析GPGPU程序运行过程中瞬时故障的产生模式以及错误的传播路径后，对GPGPU程序运行所依赖的CPU端和GPU端分别进行容错设计，并针对GPGPU程序的运行特点，设计能够降低容错运算开销同时提升系统协同工作能力的优化方案，从而在提高GPGPU程序的可靠性的同时降低容错设计所带来的额外开销.通过对典型实例的测试验证了所提出的方案的可行性以及性能.

Abstract: This paper proposes a new fault-tolerant model realized by software method to ensure the reliability of general purpose computation on graphics hardware (GPGPU) on CPU-CPU heterogeneous platform.After analyzing the transient fault occurrence mode and error propagation of GPGPU,fault-tolerant designed both in CPU side and GPU side.An optimal scheme of the fault-tolerant which can reduce the computational overhead and enhance the ability of system interoperability is raised according to the feature of GPGPU.In addition,overhead from the design of fault-tolerance will decline when improving the reliability of GPGPU program.Finally,the feasibility and performance of the model proposed is tested and verified on typical examples.