Abstract:
As the number of nodes used in high performance applications increases, the probability that failures occur in the computing processes of applications increases.For long running applications it is essential that fault tolerance be used to survive fail-stop failures.The parallel multiple grid algorithm (MG) is widely used to solve the PDEs in large-scale project and physical problems.To implement the fault tolerant ability of MG algorithm, we design a fault tolerance algorithm (named FT-MG) based on fault tolerance MPI.Experimental results prove the fault tolerance ability of FT-VMG and demonstrate that the fault tolerant overhead of FT-MG is pretty small.