Metadata-Version: 2.4
Name: duvidnn
Version: 0.0.1
Summary: Deep learning with uncertainty for general purpose, chemistry, and taxonomic tasks.
Author-email: Eachan Johnson <eachan.johnson@crick.ac.uk>
License: MIT License
        
        Copyright (c) [year] [fullname]
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: Homepage, https://github.com/scbirlab/duvidnn
Project-URL: Bug Tracker, https://github.com/scbirlab/duvidnn/issues
Keywords: ai,active-learning,bayesian-optimization,data,deep-learning,machine-learning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: carabiner-tools[mpl,pd]>=0.0.4
Requires-Dist: datasets<4,>=3.0.0
Requires-Dist: duvida[torch]>=0.0.3.post1
Requires-Dist: huggingface_hub
Requires-Dist: numpy
Requires-Dist: platformdirs>=4.4.0
Requires-Dist: torch<=2.9,>=2.4
Requires-Dist: torchvision
Requires-Dist: lightning
Requires-Dist: scipy
Requires-Dist: tensorboard
Requires-Dist: tqdm
Provides-Extra: transformers
Requires-Dist: torch>=2.4; extra == "transformers"
Requires-Dist: torchvision; extra == "transformers"
Requires-Dist: transformers; extra == "transformers"
Provides-Extra: chem
Requires-Dist: chemprop>=2.1.0; extra == "chem"
Requires-Dist: schemist>=0.0.4.post1; extra == "chem"
Requires-Dist: scikit-learn>=1.7; extra == "chem"
Provides-Extra: bio
Requires-Dist: vectome>=0.0.4; extra == "bio"
Provides-Extra: splits
Requires-Dist: umap-learn; extra == "splits"
Requires-Dist: scipy; extra == "splits"
Requires-Dist: schemist>=0.0.4.post1; extra == "splits"
Requires-Dist: tdigest>=0.5; extra == "splits"
Requires-Dist: faiss-cpu>=1.11; extra == "splits"
Provides-Extra: splits-gpu
Requires-Dist: umap-learn; extra == "splits-gpu"
Requires-Dist: scipy; extra == "splits-gpu"
Requires-Dist: schemist>=0.0.4.post1; extra == "splits-gpu"
Requires-Dist: tdigest>=0.5; extra == "splits-gpu"
Requires-Dist: faiss-gpu; extra == "splits-gpu"
Provides-Extra: dev
Requires-Dist: duvidnn[bio,chem,splits,transformers]; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Dynamic: license-file

# 🧐 duvidnn

![GitHub Workflow Status (with branch)](https://img.shields.io/github/actions/workflow/status/scbirlab/duvidnn/python-publish.yml)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/duvidnn)
![PyPI](https://img.shields.io/pypi/v/duvidnn)

**duvidnn** is a suite of python tools for calculating confidence and information metrics 
for deep learning. It provides a higher-level framework for calculating confidence and information metrics
of general purpose, taxonomic and chemistry-specific neural networks. 

As a bonus, **duvidnn** also provides an easy command-line interface for training and testing models.

- [Installation](#installation)
- [Command-line interface](#command-line-interface)
- [Python API](#python-api)
    - [Neural networks](#neural-networks)
    - [More advanced API](#more-advanced-python-api-implementing-a-new-modelbox)
- [Issues, problems, suggestions](#issues-problems-suggestions)
- [Documentation](#documentation)

## Installation

### The easy way

You can install the precompiled version directly using `pip`.

```bash
$ pip install duvidnn
```

If you want to use duvidnn for chemistry machine learning and AI, use:

```bash
$ pip install duvidnn[chem]
```

For integrating taxonomic information with [vectome](https://github.com/scbirlab/vectome), use:

```bash
$ pip install duvidnn[bio]
```

You can install both:

```bash
$ pip install duvidnn[bio,chem]
```

### From source

Clone the repository, then `cd` into it. Then run:

```bash
$ pip install -e .
```

## Command-line interface

**duvidnn** has a command-line interface for training and checkpointing the built-in models. 

```bash
$ duvidnn --help
usage: duvidnn [-h] [--version] {hyperprep,train,predict,split,percentiles} ...

Calculating exact and approximate confidence and information metrics for deep learning on general purpose and chemistry tasks.

options:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit

Sub-commands:
  {hyperprep,train,predict,split,percentiles}
                        Use these commands to specify the tool you want to use.
    hyperprep           Prepare inputs for hyperparameter search.
    train               Train a PyTorch model.
    predict             Make predictions and calculate uncertainty using a duvidnn checkpoint.
    split               Make chemical train-test-val splits on out-of-core datasets.
    percentiles         Add columns indicating whether rows are in a percentile.
```

In all cases, you can get further options with `duvidnn <command> --help`, for example:

```bash
duvidnn train --help
```

### Annotating top percentiles

You can add columns to datasets which annotate the top percentiles of named columns. This is compatible
with extremely large datasets that don't fit in memory.

```bash
$ duvidnn percentiles \
    hf://scbirlab/fang-2023-biogen-adme@scaffold-split:train \
    --columns clogp tpsa \
    --percentiles 1 5 10 \
    --output percentiles.parquet \
    --plot percentiles-plot.png \
    --structure smiles
```

In all cases, input data can be:
- Path to a _local_ file in CSV, Parquet, Arrow or HF Dataset format
- _or_ a remote dataset hosted on [🤗 Datasets](https://huggingface.co/datasets), 
indicated by `hf://` followed by the repository name

### Data splitting

There are utilities for out-of-memory scaffold and (approximate using FAISS) spectral splitting of datasets
that don't fit in memory. Make it random but reproducible with `--seed`, otherwise a deterministic bin-packing
algorithm is used.

```bash
$ duvidnn split hf://scbirlab/fang-2023-biogen-adme@scaffold-split:train \
    --train .7 \
    --validation .15 \
    --structure smiles \
    --type faiss \
    --seed 1 \
    --output faiss.csv \
    --plot faiss.png
  ```

### Model training and evaluation

To train:

```bash
$ duvidnn train -1 hf://scbirlab/fang-2023-biogen-adme@scaffold-split:train \
    -2 hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test \
    --class fingerprint \
    --structure smiles \
    --ensemble-size 10 \
    --epochs 10 \
    --learning-rate 0.001 \
    --output model.dv
```

Different model classes can be specified:


### Hyperparameters

There is also a simple hyperparameter utility.

```bash
$ printf '{"model_class": "fingerprint", use_2d": [true, false], "n_units": 16, "n_hidden": 3}' | duvidnn hyperprep -o hyperopt.json
```

This generates a file containing all combinations. It can be indexed (0-based) 
with the `-i <int>` option to supply a specific training configuration like so:

```bash
$ duvidnn train \
    -1 hf://scbirlab/fang-2023-biogen-adme@scaffold-split:train \
    -2 hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test \
    -c hyperopt.json \
    -i 0 \
    --output model.dv
```

In this way, you can generate all the hyperparameter combinations, then systematically test them one by one 
(or in parallel using HPC or other methods).

### Predictions

You can make predictions on datasets using `duvidnn predict`. Optionally, you can restrict prediction to only a chunk of the
dataset using `--start` and `--stop`. This can be useful to parallelize prediction across chunks.

When predicting, there is also the option to calculate uncertainty metrics like ensemble variance (`--variance`), 
Tanimoto nearest neighbor distance to training set (`--tanimoto`, for chemistry models), doubtscore (`--doubtscore`), 
and information sensitivity (`--information-sensitivity`). 

```bash
$ duvidnn predict \
    --test hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test \
    --checkpoint model.dv \
    --start 100 \
    --end 200 \
    --variance \
    --tanimoto \
    --doubtscore \
    -y clogp \
    --output predictions.parquet
```

Outputs can be made in CSV, Parquet, Arrow, or HF Dataset format. This is inferred from the file extension of
the filename provided for `--output`. 

Note that information sensitivity using default parameters can be very slow for large models with large
training data, since it must calculate second-order parameter gradients for every training example. There are 
approximations which can speed it up substantially, at the cost of exactness:
- The `--last-layer` option gives the biggest speed-up, since it restricts the calculation to only the output layer of the model. 
- Using `--optimality` assumes the model has been trained to an optimum (i.e. gradient of loss is zero). 
- The `--approx bekas` option uses a fast approximation of second-order gradients.

## Python API

**duvidnn** provides python classes and functions for custom analysis.

### Neural networks

The core of **duvidnn** is the `ModelBox`, which is a container for a trainable model and its training data.
These are connected because measures of confidence and information gain depend directly on the information
or evidence already seen by the model.

There are several `ModelBox` classes for specific deep learning architechtures in pytorch. 

```python
>>> from duvidnn.autoclass import MODELBOX_REGISTRY
>>> from pprint import pprint
>>> pprint(MODELBOX_REGISTRY)
{'bilinear': <class 'duvidnn.torch.modelbox.modelboxes.TorchBilinearModelBox'>,
 'bilinear-fp': <class 'duvidnn.torch.modelbox.modelboxes.TorchBilinearFingerprintModelBox'>,
 'chemprop': <class 'duvidnn.torch.modelbox.modelboxes.ChempropModelBox'>,
 'cnn': <class 'duvidnn.torch.modelbox.modelboxes.TorchCNN2DModelBox'>,
 'fingerprint': <class 'duvidnn.torch.modelbox.modelboxes.TorchFingerprintModelBox'>,
 'mlp': <class 'duvidnn.torch.modelbox.modelboxes.TorchMLPModelBox'>}
```

The modelboxes `chemprop`, `fingerprint`, and `bilinear-fp` featurize SMILES representations of chemical 
structures. The modelbox `mlp` is a general purpose multilayer perceptron.

You can set up your model with various training parameters.

```python
from duvidnn.autoclass import AutoClass
modelbox = AutoClass(
    "fingerprint",
    n_units=16,
    n_hidden=2,
    ensemble_size=10,
    structure_column="smiles",
)
```

The internal neural network is instantiated on loading training data.

```python
modelbox.load_training_data(
    data="hf://scbirlab/fang-2023-biogen-adme@scaffold-split:train",
    inputs="smiles", # column name of the predictor values
    labels="clogp",  # column name of the values to predict
)
```

The `data` can be a remote 🤗 dataset, in which case it is automatically downloaded. The `"@"`
indicates the dataset configuration, and the `":"` indicates the specific data split.

Alternatively, the training data can be a local CSV or TSV file, or in-memory Pandas dataframes 
or dictionaries.

With training data loaded, the model can be trained!

```python
modelbox.train(
    val_filename="hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test",
    epochs=10,
    batch_size=128,
)
```

The `ModelBox.train()` method uses pytorch Lightning under the hood, so other options such as callbacks
for this framework should be accepted.

#### Saving and sharing a trained model

**duvidnn** provides a basic checkpointing mechanism to save model weights and training data to later reload.

```python
modelbox.save_checkpoint("checkpoint.dv")
modelbox.load_checkpoint("checkpoint.dv")
```

#### Evaluating and predicting on new data

**duvidnn** `ModelBox`es provide methods for evaluating predictions on new data.

```python
predictions, metrics = modelbox.evaluate(
    data="hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test",
)
```

#### Calculating uncertainty and information metrics

**duvidnn** `ModelBox`es provide methods for calculating prediction variance of ensembles,
doubtscore, and information sensitivity.

```python
doubtscore = modelbox.doubtscore(
    data="hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test"
)
info_sens = modelbox.information_sensitivity(
    data="hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test",
    approx="bekas",  # approximate Hessian diagonals
    n=10,
)
```

To avoid storing large datasets in memory, **duvidnn** uses [🤗 datasets](https://huggingface.co/docs/datasets/) under the hood
to cache data. Results can be instantiated in memory with a little effort. For example:

```python
doubtscore = doubtscore.to_pandas()
```

See the [🤗 datasets documentation](https://huggingface.co/docs/datasets/) for more.

## More advanced Python API: Implementing a new `ModelBox`

Bringing a new pytorch model to **duvidnn** is relatively straightforward. First, write your model,
adding Lighning logic and a `create_model()` method:

```python
from typing import Callable, Iterable, List, Mapping, Optional

from torch.nn import BatchNorm1d, Dropout, Linear, Module, SiLU, Sequential
from duvidnn.torch import TorchEnsembleMixin
from duvidnn.torch.models.utils.lt import LightningMixin
from torch import nn
from torch.optim import Adam, Optimizer

class SimpleMLP(nn.Module, LightningMixin):

    def __init__(
        self, 
        n_input: int, 
        n_units: int = 16, 
        n_out: int = 1,
        activation: Callable = nn.SiLU,  # Smooth activation to prevent vanishing gradient
        learning_rate: float = .01,
        optimizer: Optimizer = Adam,
        *args, **kwargs
    ):
        super().__init__(*args, **kwargs)
        self.n_input = n_input
        self.n_units = n_units
        self.activation = activation
        self.n_out = n_out
        self.model_layers = nn.Sequential([
            nn.Linear(self.n_input, self.n_units),
            self.activation(),
            nn.Linear(self.n_units, self.n_out),
        ])
        # Lightning logic
        self._init_lightning(
            optimizer=optimizer, 
            learning_rate=learning_rate, 
            model_attr='model_layers',  # the attribute containing the model
        )

    def forward(self, x):
        return self.model_layers(x)
```

Then subclass `duvidnn.torch.modelbox.TorchModelBoxBase` and implement the `create_model()` method, which should
simply return your instantiated model. If you want to preprocess input data on the fly, then
add a `preprocess_data()` method which takes a data dictionary and returns a data dictionary.

```python
from typing import Dict

from duvidnn.torch.modelbox import TorchModelBoxBase
import numpy as np

class MLPModelBox(TorchModelBoxBase):
    
    def __init__(self, *args, **kwargs):
        super().__init__()
        self._mlp_kwargs = kwargs

    def create_model(self, *args, **kwargs):
        self._model_config.update(kwargs)  # makes sure model checkpointing saves the keyword args
        return SimpleMLP(
            n_input=self.input_shape[-1],  # defined on data loading
            n_out=self.output_shape[-1], 
            *args, 
            **self._model_config,
            **self._mlp_kwargs,  # if init kwargs are relevant to model creation
        )

    # Define this method if your data needs preprocessing
    @staticmethod
    def preprocess_data(data: Dict[str, np.ndarray], _in_key, _out_key, **kwargs) -> Dict[str, np.ndarray]:
        return {
            _in_key: your_featurizer(data[_in_key]), 
            _out_key: np.asarray(data[_out_key])
        }
```

If you want to build `ModelBox`es based on a framework other than pytorch, you can subclass 
the `duvidnn.base.ModelBoxBase` abstract class, making sure to implement its abstract methods.

## Issues, problems, suggestions

Add to the [issue tracker](https://www.github.com/scbirlab/duvidnn/issues).

## Documentation

(To come at [ReadTheDocs](https://duvidnn.readthedocs.org).)
