Metadata-Version: 2.4
Name: augment-atoms
Version: 0.2.0
Summary: Add your description here
License-File: LICENSE
Requires-Python: >=3.9
Requires-Dist: ase>=3.25.0
Requires-Dist: dacite>=1.9.2
Requires-Dist: data2objects>=0.1.0
Requires-Dist: load-atoms>=0.3.9
Requires-Dist: vesin>=0.3.3
Description-Content-Type: text/markdown

# `augment-atoms`

<div align="center">

[![Test](https://github.com/jla-gardner/augment-atoms/actions/workflows/test.yaml/badge.svg)](https://github.com/jla-gardner/augment-atoms/actions/workflows/test.yaml)
[![PyPI](https://img.shields.io/pypi/v/augment-atoms)](https://pypi.org/project/augment-atoms/)
[![GitHub last commit](https://img.shields.io/github/last-commit/jla-gardner/augment-atoms)](https://github.com/jla-gardner/augment-atoms/commits/main)
[![License](https://img.shields.io/badge/license-MIT-blue)](LICENSE)

</div>

`augment-atoms` is a tool for augmenting datasets of atomic configurations via a model-driven, GPU-accelerated, rattle-relax-repeat procedure.

For each structure in the starting dataset, `augment-atoms` uses the provided potential energy surface (PES) model to generate a "family tree" of new structures. 
In the beginning, the tree consists of the single starting structure. 
To generate a new "child" structure, `augment-atoms`:
1. selects a "parent" structure from the tree,
2. rattles the atomic positions and unit cell,
3. relaxes using the PES model to get a new structure,
4. labels the child structure with the PES model, and
5. inserts the child structure into the tree.

For precise details of each of these steps, see the [Details](#details) section below.


## Installation

```bash
pip install augment-atoms
```

This will install the `augment-atoms` command line tool (see `pyproject.toml` for the dependencies, requires Python 3.9+). Using [`uv`](https://docs.astral.sh/uv/) is recommended, and will install `augment-atoms` with the correct dependencies in under 20 seconds starting from scratch.

There are no specific hardware requirements for `augment-atoms`. If a GPU is available, and the PES model supports it, the GPU will be used to accelerate structure generation. `augment-atoms` has been tested on both Linux and macOS.

## Usage

```bash
augment-atoms config.yaml
```

where `config.yaml` is a YAML file containing the following:

```yaml
data:
  # an ase-readable file containing the starting structures
  input: input.xyz 

  # an ase-writeable path to append the new structures to
  output: output.xyz

config:
  # number of augmentations per starting structure
  n_per_structure: 10
  
  # the temperature
  T: 300  # units are Kelvin

  # the explore-vs-exploit trade-off (see below)
  beta: 0.5

  # the range of values from which to sample a 
  # standard deviation to rattle with at each step
  sigma_range: [0.01, 0.1]  # units are Å

  # the random seed to use (for reproducibility)
  seed: 42

  # the standard deviation of the cell perturbation
  # if null, no cell perturbation is applied
  cell_sigma: null  # units are Å
  
  # the units of the energies generated by the PES model
  units: eV

  # the maximum force magnitude to relax to
  max_force: 30  # units are (energy / Å)

  # the minimum separation between atoms to consider
  min_separation: 0.5  # units are Å

  # the maximum number of relaxations to perform per iteration
  max_relax_steps: 20

  # the threshold for considering a structure too similar to the existing pool
  similarity_threshold: 0.1  # units are Å

model:
  # the calculator to use to generate the PES model
  calculator: +lennard_jones()
```

In-built options for the `calculator` are:

- a Lennard-Jones calculator:
```yaml
model:
  calculator: +lennard_jones()
```
- any model from the [graph-pes](https://github.com/jla-gardner/graph-pes) package. If a GPU is available, it will be used to accelerate the PES model.
```yaml
model:
  calculator:
    +graph_pes_calculator:
      path: path/to/model.pt
```

Alternatively, you are free to point to any instance of an [ase.Calculator](https://wiki.fysik.dtu.dk/ase/ase/calculators/calculator.html) object.
If you have `my_function` in `my_file.py` that returns an `ase.Calculator` object, you can use it as follows:
```yaml
model:
  calculator: +my_file.my_function()
```

## Details

### 1. Selecting a parent structure

To choose a new parent structure, we randomly sample from all structures in the tree, such that atom $i$ in structure $i$ has a probability of being picked given by

$$\mathbb{P}_i = \beta \cdot \frac{e^{-E_i / kT}}{\sum_j e^{-E_j / kT}} + (1-\beta) \cdot \frac{G_i}{\sum_j G_j}$$

where $E_i$ is the energy of structure $i$ and $G_i \in \mathbb{Z}^+$ is the `generation' of the structure, $k$ is the Boltzmann constant, $T$ is the temperature and $\beta \in [0, 1]$.
Small values of $\beta$ favour the sampling of "younger" structures in the family tree, and hence a greater degree of exploration.
Large values of $\beta$ favour the sampling of lower energy structures, and hence a denser sampling of the PES around energy minima.

### 2. Rattling the atomic positions and unit cell

To create a "child" from this parent structure, we perform the following transformation:

$$\begin{aligned}
R^\prime &\leftarrow [(A + I) \times R] + B \\
C^\prime &\leftarrow (A + I) \times C_0
\end{aligned}$$

where 
- $R$ are the atomic positions
- $C_0$ is the unit cell of the original seed structure
- $A \in \mathbb{R}^{3\times 3}$ has entries sampled from $\mathcal{N}(0, \sigma_{A})$ where $\sigma_{A} \in \rm{sigma \\_ range}$
- $B \in \mathbb{R}^{N \times 3}$ has entries sampled from $\mathcal{N}(0, \sigma_{B})$ where $\sigma_{B} \in [0, \rm{cell \\_ sigma}]$

In the case of isolated structures, we only rattle the positions (i.e. $A = 0^{3 \times 3}$).

### 3. Relaxing the rattled child structure

To relax the rattled child structure, we use energies and forces generated by the PES model using a scheme inspired by the [Robbins-Monro algorithm](https://en.wikipedia.org/wiki/Stochastic_approximation).

Step $x$ of this relaxation involves updating the atomic positions according to:

$$R^\prime \leftarrow R + \frac{\sigma_B}{x} \cdot \frac{F}{||F||}$$

where $F/||F||$ are the normalised unit vectors corresponding to the direction of each atomic force.
We perform up to $M$ relaxations steps, but stop early with probability $\min(0.25, e^{-\Delta E / kT})$ providing the maximum force magnitude is less than `config.max_force` and where $\Delta E$ is the energy difference between the relaxed child and its starting parent structure.
We reject all final structures that have any pair of atoms closer than `config.min_separation` Å.

## Demo

> This demo uses structures and a model taken from this repo's sister repository, found [here](https://github.com/dft-dutoit/synthetic-distillation).

We include a stand-alone demo usage in the `demo` directory. This takes 3 water structures as input and uses a PaiNN model to generate and label 27 new structures, for a total of 30 structures.

The `demo` directory has the following files:
- `input.xyz` contains 3 starting water structures
- `config.yaml` contains the configuration for the demo
- `model.pt` is a PaiNN model trained on water structures from ...
- `output.xyz` is the augmented dataset output.

To run this demo yourself:

```bash
# clone the repository
git clone https://github.com/jla-gardner/augment-atoms.git
cd augment-atoms/demo
# remove the output file if it exists
rm -rf output.xyz
# run the demo
augment-atoms config.yaml
```

This entire script took under 10 seconds on my M1 MacBook Pro.

## Citation

If you use `augment-atoms` in your research, please cite the following pre-print:

```bibtex
@misc{Gardner-25-06,
  title = {Distillation of Atomistic Foundation Models across Architectures and Chemical Domains},
  author = {Gardner, John L. A. and du Toit, Daniel F. Thomas and Mahmoud, Chiheb Ben and Beaulieu, Zo{\'e} Faure and Juraskova, Veronika and Pa{\c s}ca, Laura-Bianca and Rosset, Louise A. M. and Duarte, Fernanda and Martelli, Fausto and Pickard, Chris J. and Deringer, Volker L.},
  year = {2025},
  number = {arXiv:2506.10956},
  doi = {10.48550/arXiv.2506.10956},
}
```