Metadata-Version: 2.1
Name: macaw-py
Version: 1.0.1
Summary: MACAW molecular embedder and generator
Home-page: https://github.com/LBLQMM/macaw
Author: Vincent Blay
Author-email: vblayroger@lbl.gov
License: UNKNOWN
Keywords: cheminformatics,molecular design,drug discovery,synthetic biology
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Chemistry
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: English
Description-Content-Type: text/markdown
Requires-Dist: numpy (>=1.8.0)
Requires-Dist: scikit-learn (>=0.24.1)
Requires-Dist: scipy (>=1.6.1)
Requires-Dist: selfies (>=2.0.0)
Requires-Dist: umap-learn (>=0.5.2)

# MACAW

**MACAW** (Molecular AutoenCoding Auto-Workaround) is a cheminformatic tool for Python that embeds molecules in a low-dimensional, continuous numerical space. It also enables the generation of new molecules on specification.

MACAW embeddings are molecular features that can be used as inputs in mathematical and machine-learning models. MACAW embeddings can be used as an alternative for conventional molecular descriptors. The embeddings are fast and easy to compute, variable selection is not needed, and they may enable more accuracte predictive models than conventional molecular descriptors.

MACAW also provides original algorithms to generate molecular libraries and to evolve molecules *in silico* that meet a desired specification (inverse molecular design). The design specification can be any property or combination of properties that can be predicted for the molecule, such as its octane number or its binding affinity to a protein.

Details about the different algorithms are explained in the [MACAW publication](https://doi.org/10.26434/chemrxiv-2022-x647j).


## Installation

MACAW requires rdkit 2020.09.4 or later to run, which can be installed using [conda](https://anaconda.org/conda-forge/rdkit):

```bash
conda install -c conda-forge rdkit
```

Alternative methods to install rdkit are given [here](https://www.rdkit.org/docs/Install.html).


Then run the following command to install MACAW:

```bash
pip install macaw_py
```

### Documentation

Read the documentation on [Read the Docs](https://macaw.readthedocs.io/en/latest/).

## Usage

The following illustrates some of the main commands in MACAW. Detailed use examples with real datasets are available as Jupyter Notebooks in the [MACAW repository](https://github.com/LBLQMM/macaw).


### Molecule embedding

Given a list of molecules represented as SMILES strings (`smiles`), their MACAW embeddings (`X`) can be obtained as follows:

```python
from macaw import *

mcw = MACAW()
mcw.fit(smiles)
X = mcw.transform(smiles)
```

Any list of molecules in SMILES format (`qsmiles`) can be embedded using an existing MACAW object:

```python
X_new = mcw.transform(qsmiles)
```

The embedder has a variety of parameters that can be tuned to improve results. These include the dimensionality of the embedding (`n_components`), the number of landmarks used (`n_landmarks`), the type of molecular fingeprint (`type_fp`), and the similarity metric (`metric`). Property values (`y_values`) can also be provided to the argument `Y` to improve landmark choice. Other arguments and options available are explained in the documentation.

```python
mcw = MACAW(n_components=20, type_fp='rdk5', metric='Dice', n_landmarks=60)

mcw.fit_transform(smiles, Y=y_values)
```

The function `MACAW_optimus` automatically explores a variety of fingeprint type (`type_fp`) and similarity metric (`metric`) combinations and returns a recommended embedder ready for use:

```python
mcw = MACAW_optimus(smiles, n_components=20, y=y_values, verbose=True)
```

### Molecule generation

Given an input dataset of molecules in SELFIES format, MACAW's `library_maker` function will generate a library of molecules around it. The maximum number of molecules to generate is specified with the `n_gen` parameter, while the spread of the distribution can be controlled with the `noise_factor` argument. Additional parameters are explained in the function help.


```python
smiles_lib = library_maker(smiles, n_gen=50000, noise_factor=0.3)
```

### Molecule recommendation (inverse design)

Given a property of interest, a model `f` can be trained to predict the property values of different molecules. The model `f` takes as inputs the features generated by the embedder `mcw`.

Then, we can evolve and recommend molecules to satisfy a desired property specification value (`spec`) using the function `library_evolver`. It takes as input an initial set of molecules (`smiles`), the featurizer (`mcw`), the predictive model (`f`), the desired specification value (`spec`), the number of molecules ro recommend (`n_hits`), the number of evolution rounds (`n_rounds`). Other optional arguments described in the function help.

```python
recommended_smiles = library_evolver(smiles, mcw, f, spec, n_hits=10, n_rounds=8)
```

## License

MACAW code is distributed under the license specified in the [`Noncommercial_Academic_LA.pdf`](https://github.com/LBLQMM/MACAW/blob/main/Noncommercial_Academic_LA.pdf) file. This license allows free **non-commercial** use for **academic institutions**. Modifications should be fed back to the original repository to benefit all users. 

Separate **evaluation** and **commercial use** licenses are available for businesses. Business users, please contact [LBNL Licensing](mailto:jhaemmerle@lbl.gov).


