Abstract
A lot of research has been done on fault-tolerance for MPI applications, some on checkpoint/restart, and some on network fault-tolerance. Process migration, however, has not gained widespread use due to the additional complexity of the requirement that the knowledge about the new location of a migrated process has to be made known to every other process in the application. Here we present a simple yet effective method of process migration based on coordinated checkpointing of MPI applications. Migration is achieved by checkpointing the application, modifying the process location information in the checkpoint files, and restarting the application. Checkpoint/restart and migration are transparent to MPI applications. Performance evaluation results showed that the additional checkpoint/restart capability has little impact on application performance, and the migration method scales well on a large number of nodes.
Original language | English |
---|---|
Title of host publication | Proceedings - 11th International Conference on Parallel and Distributed Systems Workshops, ICPADS 2005 |
Pages | 306-312 |
Number of pages | 7 |
Volume | 1 |
DOIs | |
Publication status | Published - 1 Sept 2005 |
Event | 11th International Conference on Parallel and Distributed Systems Workshops, ICPADS 2005 - Fukuoka, Japan Duration: 20 Jul 2005 → 22 Jul 2005 |
Conference
Conference | 11th International Conference on Parallel and Distributed Systems Workshops, ICPADS 2005 |
---|---|
Country/Territory | Japan |
City | Fukuoka |
Period | 20/07/05 → 22/07/05 |
Keywords
- Checkpoint/restart
- Coordinated checkpoint
- MPI
- Process migration
ASJC Scopus subject areas
- Hardware and Architecture