Running on Cloud / HPC¶
Run backmapping simulations on cloud spot instances or HPC clusters with checkpoint/restart support for fault tolerance.
Enabling Restart Checkpoints¶
Add restart_interval to your settings.yaml:
simulation:
restart_interval: 5000 # write checkpoint every 5000 steps
alpha: 0.001
# ... other settings
When set, backmap-prep generates:
| File | Purpose |
|---|---|
in.backmap |
Master script (includes restart commands) |
in.backmap.setup |
Shared style/coefficient definitions |
in.backmap.phase1 |
First segment: CG equilibration if equilibration_steps > 0, else backmapping λ 0→1 |
in.backmap.phase2 |
Only if equilibration_steps > 0: backmapping ramp; or if equilibration_steps == 0 and production_steps > 0: optional AT production |
in.backmap.phase3 |
Only if equilibration_steps > 0 and production_steps > 0: optional AT production |
With equilibration_steps: 0 and production_steps: 0 (default), only phase1
is generated for per-phase restarts (backmapping only, writes *_hybrid.data).
Without restart_interval, only the master in.backmap is generated (backward
compatible).
How checkpointing works¶
- LAMMPS writes alternating restart files (
restart.backmap,restart.backmap2) every N steps — if one is corrupted mid-write, the other is still valid. - At each phase boundary, a clean
write_restartand a sentinel file (phase_1.done,phase_2.done,phase_3.done) are written. - On restart, the entrypoint script checks sentinels to determine which phase to resume.
Entrypoint Script¶
The run-backmap.sh script handles restart logic automatically:
# Fresh start
run-backmap.sh -np 4 -in in.backmap
# After preemption — detects restart files and resumes
run-backmap.sh -np 4 -in in.backmap
The script:
- Checks for
restart.backmap/restart.backmap2. - If no restart file: runs
in.backmap(fresh start). - If restart exists: reads sentinel files to determine the phase, runs the
appropriate
in.backmap.phaseN.
Google Cloud Batch (Spot VMs)¶
Cloud Batch runs containerised jobs on spot (preemptible) VMs with automatic retry on preemption.
Quick start¶
# 1. Build and push image
docker build -t lammps-backmap .
docker tag lammps-backmap REGION-docker.pkg.dev/PROJECT/REPO/lammps-backmap:latest
docker push REGION-docker.pkg.dev/PROJECT/REPO/lammps-backmap:latest
# 2. Upload simulation data to GCS
gsutil -m cp *.data in.* *.table gs://BUCKET/sim-data/
# 3. Submit job
gcloud batch jobs submit my-backmap-job \
--location=us-central1 \
--config=examples/cloud-batch/job.json
See examples/cloud-batch/
for the full job template and setup instructions.
Key settings in job.json¶
| Setting | Value | Purpose |
|---|---|---|
provisioningModel |
SPOT |
Use preemptible VMs (up to 90% cheaper) |
maxRetryCount |
3 |
Retry on preemption |
maxRunDuration |
14400s |
4-hour timeout per attempt |
| GCS volume mount | /work |
Persistent storage survives preemption |
Traditional HPC (Singularity / Apptainer)¶
For clusters without Docker support, convert to a SIF image:
docker save lammps-backmap -o lammps-backmap.tar
apptainer build lammps-backmap.sif docker-archive://lammps-backmap.tar
Running with restart on HPC¶
Copy scripts/run-backmap.sh to your working directory along with the
generated input files, then use it in your job script:
#!/bin/bash
#SBATCH --job-name=backmap
#SBATCH --ntasks=64
#SBATCH --time=04:00:00
#SBATCH --requeue
srun apptainer exec lammps-backmap.sif \
run-backmap.sh -np 64 -in in.backmap
With --requeue, Slurm resubmits the job if it's preempted. On restart,
run-backmap.sh detects the checkpoint and resumes from the correct phase.
MPI compatibility
For multi-node runs, the host MPI must be ABI-compatible with the container's OpenMPI. See the Docker page for binding instructions.