Running on Cloud / HPC¶

Run backmapping simulations on cloud spot instances or HPC clusters with checkpoint/restart support for fault tolerance.

Enabling Restart Checkpoints¶

Add restart_interval to your settings.yaml:

simulation:
  restart_interval: 5000   # write checkpoint every 5000 steps
  alpha: 0.001
  # ... other settings

When set, backmap-prep generates:

File	Purpose
`in.backmap`	Master script (includes restart commands)
`in.backmap.setup`	Shared style/coefficient definitions
`in.backmap.phase1`	First segment: CG equilibration if `equilibration_steps > 0`, else backmapping λ 0→1
`in.backmap.phase2`	Only if `equilibration_steps > 0`: backmapping ramp; or if `equilibration_steps == 0` and `production_steps > 0`: optional AT production
`in.backmap.phase3`	Only if `equilibration_steps > 0` and `production_steps > 0`: optional AT production

With equilibration_steps: 0 and production_steps: 0 (default), only phase1 is generated for per-phase restarts (backmapping only, writes *_hybrid.data).

Without restart_interval, only the master in.backmap is generated (backward compatible).

How checkpointing works¶

LAMMPS writes alternating restart files (restart.backmap, restart.backmap2) every N steps — if one is corrupted mid-write, the other is still valid.
At each phase boundary, a clean write_restart and a sentinel file (phase_1.done, phase_2.done, phase_3.done) are written.
On restart, the entrypoint script checks sentinels to determine which phase to resume.

Entrypoint Script¶

The run-backmap.sh script handles restart logic automatically:

# Fresh start
run-backmap.sh -np 4 -in in.backmap

# After preemption — detects restart files and resumes
run-backmap.sh -np 4 -in in.backmap

The script:

Checks for restart.backmap / restart.backmap2.
If no restart file: runs in.backmap (fresh start).
If restart exists: reads sentinel files to determine the phase, runs the appropriate in.backmap.phaseN.

Google Cloud Batch (Spot VMs)¶

Cloud Batch runs containerised jobs on spot (preemptible) VMs with automatic retry on preemption.

Quick start¶

# 1. Build and push image
docker build -t lammps-backmap .
docker tag lammps-backmap REGION-docker.pkg.dev/PROJECT/REPO/lammps-backmap:latest
docker push REGION-docker.pkg.dev/PROJECT/REPO/lammps-backmap:latest

# 2. Upload simulation data to GCS
gsutil -m cp *.data in.* *.table gs://BUCKET/sim-data/

# 3. Submit job
gcloud batch jobs submit my-backmap-job \
    --location=us-central1 \
    --config=examples/cloud-batch/job.json

See examples/cloud-batch/ for the full job template and setup instructions.

Key settings in `job.json`¶

Setting	Value	Purpose
`provisioningModel`	`SPOT`	Use preemptible VMs (up to 90% cheaper)
`maxRetryCount`	`3`	Retry on preemption
`maxRunDuration`	`14400s`	4-hour timeout per attempt
GCS volume mount	`/work`	Persistent storage survives preemption

Traditional HPC (Singularity / Apptainer)¶

For clusters without Docker support, convert to a SIF image:

docker save lammps-backmap -o lammps-backmap.tar
apptainer build lammps-backmap.sif docker-archive://lammps-backmap.tar

Running with restart on HPC¶

Copy scripts/run-backmap.sh to your working directory along with the generated input files, then use it in your job script:

#!/bin/bash
#SBATCH --job-name=backmap
#SBATCH --ntasks=64
#SBATCH --time=04:00:00
#SBATCH --requeue

srun apptainer exec lammps-backmap.sif \
    run-backmap.sh -np 64 -in in.backmap

With --requeue, Slurm resubmits the job if it's preempted. On restart, run-backmap.sh detects the checkpoint and resumes from the correct phase.

MPI compatibility

For multi-node runs, the host MPI must be ABI-compatible with the container's OpenMPI. See the Docker page for binding instructions.