DMTCP (Distributed MultiThreaded Checkpointing) is a checkpointing package for applications. Using checkpointing allows resuming of a failing simulation due to failing resources (e.g. hardware, software, exceeded time and memory resources).
DMTCP supports both sequential and multi-threaded applications. Some examples of binary programs on Linux distributions that can be used with DMTCP are OpenMP, MATLAB, Python, Perl, MySQL, bash, gdb, X-Windows etc.
DMTCP provides support for several resource managers, including SLURM, the resource manager used in HCC. The DMTCP module is available both on Tusker and Crane, and is enabled by typing:
|module load dmtcp/2.5|
After the module is loaded, the first step is to run the command:
where --rm option enables SLURM support, <interval_time_seconds> is the time in seconds between automatic checkpoints, and <your_command> is the actual command you want to run and checkpoint.
Beside the general options shown above, more dmtcp_launch options can be seen by using:
dmtcp_launch creates few files that are used to resume the cancelled job, such as ckpt_*.dmtcp and dmtcp_restart_script*.sh. Unless otherwise stated (using --ckptdir option), these files are stored in the current working directory.
The second step of DMTCP is to restart the cancelled job, and there are two ways of doing that:
- dmtcp_restart ckpt_*.dmtcp <options> (before running this command delete any old ckp_*.dmtcp files in your current directory)
- ./dmtcp_restart_script.sh <options>
If there are no options defined in the <options> field, DMTCP will keep running with the options defined in the initial dmtcp_launch call (such as interval time, output directory etc).
Simple example of using DMTCP with BLAST on Tusker is shown below:
In this example, DMTCP takes checkpoints every hour (--interval 3600), and the actual command we want to checkpoint is blastx with some general BLAST options defined with -query, -db, -out, -num_threads.
If this job is killed for various reasons, it can be restarted using the following submit file:
dmtcp_restart generates new ckpt_*.dmtcp and dmtcp_restart_script*.sh files. Therefore, if the restarted job is also killed due to unavailable/exceeded resources, you can resubmit the same job again without any changes in the submit file shown above (just don't forget to delete the old ckpt_*.dmtcp files if you are using these files instead of dmtcp_restart_script.sh)
Even though DMTCP tries to support most mainstream and commonly used applications, there is no guarantee that every application can be checkpointed and restarted.