Process migration for MPI applications based on coordinated checkpoint

Jiannong Cao, Yinghao Li, Minyi Guo

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

21 Citations (Scopus)

Abstract

A lot of research has been done on fault-tolerance for MPI applications, some on checkpoint/restart, and some on network fault-tolerance. Process migration, however, has not gained widespread use due to the additional complexity of the requirement that the knowledge about the new location of a migrated process has to be made known to every other process in the application. Here we present a simple yet effective method of process migration based on coordinated checkpointing of MPI applications. Migration is achieved by checkpointing the application, modifying the process location information in the checkpoint files, and restarting the application. Checkpoint/restart and migration are transparent to MPI applications. Performance evaluation results showed that the additional checkpoint/restart capability has little impact on application performance, and the migration method scales well on a large number of nodes.
Original languageEnglish
Title of host publicationProceedings - 11th International Conference on Parallel and Distributed Systems Workshops, ICPADS 2005
Pages306-312
Number of pages7
Volume1
DOIs
Publication statusPublished - 1 Sept 2005
Event11th International Conference on Parallel and Distributed Systems Workshops, ICPADS 2005 - Fukuoka, Japan
Duration: 20 Jul 200522 Jul 2005

Conference

Conference11th International Conference on Parallel and Distributed Systems Workshops, ICPADS 2005
Country/TerritoryJapan
CityFukuoka
Period20/07/0522/07/05

Keywords

  • Checkpoint/restart
  • Coordinated checkpoint
  • MPI
  • Process migration

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Process migration for MPI applications based on coordinated checkpoint'. Together they form a unique fingerprint.

Cite this