Metadata-Version: 2.4
Name: torch_atomic_save
Version: 0.1.0
Summary: Atomic, asynchronous checkpoint saving for Pytorch for Slurm/Lustre environments
Author-email: Patrick Carnahan <pcarnah@uwo.ca>
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: numpy
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Dynamic: license-file

# torch-atomic-save

[![Pytest](https://github.com/pcarnah/torch-atomic-save/actions/workflows/test.yml/badge.svg)](https://github.com/pcarnah/torch-atomic-save/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

An asynchronous, atomic checkpointing utility for PyTorch, optimized for Slurm and Lustre/NFS environments.

## Key Features
* **Atomic Moves:** Prevents corrupted checkpoints during Slurm preemption.
* **Non-Blocking:** Offloads I/O to a background thread pool.
* **Race Condition Protection:** Automatically clones tensors to CPU before background saving.
* **Cross-FS Support:** Handles moves between local SSD and network storage safely.

## Installation
```bash
pip install torch-atomic-save
```


## Usage
```python
from torch_atomic_save import SlurmAtomicManager

# Initialize the manager
manager = SlurmAtomicManager(max_workers=4)

# In your training loop:
if epoch % save_interval == 0:
    manager.save(model, "path/to/checkpoints/model.pt", tmp_dir="/scratch/user/tmp")

# Ensure all I/O is finished before exiting
manager.wait_for_all()
```

## Attribution
If you use this software in your research, please cite it using the "Cite this repository" button or the provided CITATION.cff.
