Metadata-Version: 2.1
Name: protenc
Version: 0.1.0
Summary: Simplify extraction of protein embedding from various models.
Author: Kristian Klemon
Author-email: kristian.klemon@gmail.com
Requires-Python: >=3.8,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: biopython (>=1.80,<2.0)
Requires-Dist: lmdb (>=1.3.0,<2.0.0)
Requires-Dist: pandas (>=1.5.2,<2.0.0)
Requires-Dist: sentencepiece (>=0.1.97,<0.2.0)
Requires-Dist: torch (>=1.13.0,<2.0.0)
Requires-Dist: tqdm (>=4.64.1,<5.0.0)
Requires-Dist: transformers (>=4.24.0,<5.0.0)
Description-Content-Type: text/markdown

protenc
=======

protenc is a library to simplify extraction of protein embeddings from various pre-trained models, including:

* [ProtTrans](https://github.com/agemagician/ProtTrans) family
* [ESM](https://github.com/facebookresearch/esm)
* AlphaFold (coming soon™)

It provides a programmatic Python API as well as a highly flexible bulk extraction script, supporting many input and
output formats.

**Note:** the project is work in progress.

Usage
-----

**Installation**

```bash
pip install protenc
```

**Python API**

```python
import protenc
import torch

# List available models
print(protenc.list_models())

# Instantiate a model
model = protenc.get_model('esm2_t33_650M_UR50D')

proteins = [
  'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
  'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'
]

batch = model.prepare_sequences(proteins)

# Move to GPU if available
if torch.cuda.is_available():
  model = model.to('cuda')
  batch = protenc.utils.to_device(batch, 'cuda')

for embed in model(batch):
  # Embeddings have shape [L, D] where L is the sequence length and D the 
  # embedding dimensionality.
  print(embed.shape)
  
  # Derive a single per-protein embedding vector by averaging along the 
  # sequence dimension
  embed.mean(0)
```

**Command-line interface**

Coming soon.

Development
-----------

Clone the repository:

```bash
git clone git+https://github.com/kklemon/protenc.git
```

Install dependencies via [Poetry](https://python-poetry.org/):

```bash
poetry install
```

Todo
----

- [ ] Support for more input formats
  - [X] CSV
  - [ ] Parquet
  - [ ] FASTA
  - [ ] JSON
- [ ] Support for more output formats
  - [X] LMDB
  - [ ] HDF5
  - [ ] DataFrame
  - [ ] Pickle
- [ ] Large models support
  - [ ] Model offloading
  - [ ] Sharding
- [ ] Support for more protein language models
  - [ ] While ProtTrans family
  - [ ] While ESM family
    - [ ] AlphaFold (?)
- [ ] Implement all remaining TODOs in code
- [ ] Distributed inference
- [ ] Maybe support some sort of optimized inference such as quantization
  - This may be up to the model providers
- [ ] Improve documentation
- [ ] Support translation of gene sequences

