Checkpointing in hybrid distributed systems

Jiannong Cao, Yifeng Chen, Kang Zhang, Yanxiang He

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

13 Citations (Scopus)

Abstract

To provide fault tolerance to computer systems suffering from transient faults, checkpointing and rollback recovery is one of the widely-used techniques. Among others, two primary checkpointing schemes have been proposed: independent and coordinated schemes. However, most existing works address only the need of employing a single checkpointing and rollback recovery scheme to a target system. In this paper, issues are discussed and a new algorithm is developed to address the need of integrating independent and coordinated checkpointing schemes for applications running in a hybrid distributed environment containing multiple heterogeneous subsystems. The required changes to the original checkpointing schemes for each subsystem and the overall prevented unnecessary rollbacks for the integrated system are presented. Also described is an algorithm for collecting garbage checkpoints in the combined hybrid system.
Original languageEnglish
Title of host publicationProceedings of the International Symposium on Parallel Architectures, Algorithms and Networks, I-SPAN
Pages136-141
Number of pages6
Publication statusPublished - 16 Aug 2004
EventProceedings on the International Symposium on Parallel Architectures, Algorithms and Networks, I-SPAN - Hong Kong, Hong Kong
Duration: 10 May 200412 May 2004

Conference

ConferenceProceedings on the International Symposium on Parallel Architectures, Algorithms and Networks, I-SPAN
CountryHong Kong
CityHong Kong
Period10/05/0412/05/04

ASJC Scopus subject areas

  • Computer Science(all)

Cite this