Metadata-Version: 2.1
Name: proteinnetpy
Version: 1.0.1
Summary: Read, process and write ProteinNet data
Author-email: Alistair Dunham <ad44@sanger.ac.uk>
Maintainer-email: Alistair Dunham <ad44@sanger.ac.uk>
License:    Copyright 2020 EMBL - European Bioinformatics Institute
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
             http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
        
        
Project-URL: Repository, https://github.com/allydunham/proteinnetpy
Project-URL: Documentation, https://proteinnetpy.readthedocs.io/en/latest/
Project-URL: Publication, https://doi.org/10.1186/s13059-023-02948-3
Keywords: protein,bioinformatics,proteinnet,machine learning
Requires-Python: >=3
Description-Content-Type: text/markdown
Provides-Extra: datasets
License-File: LICENSE

# ProteinNetPy 1.0.1
<!-- badges: start -->
[![DOI](https://zenodo.org/badge/267846791.svg)](https://zenodo.org/badge/latestdoi/267846791)
[![Documentation Status](https://readthedocs.org/projects/proteinnetpy/badge/?version=latest)](https://proteinnetpy.readthedocs.io/en/latest/?badge=latest)
<!-- badges: end -->

A python library for working with [ProteinNet](https://github.com/aqlaboratory/proteinnet) text data, allowing you to easily load, stream and filter data, map functions across records and produce TensorFlow datasets.
For details of the dataset see the ProteinNet [Bioinformatics paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2932-0).
Documentation for all functions of the module is available [here](https://proteinnetpy.readthedocs.io/en/latest/).

## Install

`pip install proteinnetpy`

Or install the development version from Github:

`pip install git+https://github.com:allydunham/proteinnetpy`

## Requirements

* Python 3
* Numpy
* Biopython
* TensorFlow (if using the `datasets` module)

## Basic Usage

The main object used in ProteinNetPy is the ProteinNetRecord, which allows access to the various record fields and methods for common manipulations, such as calculating a one-hot sequence representation or residue distance matrix.
It also supports most applicable operations like `len`, `str` etc.
While the `parser` module contains a generator to parse files, it is generally easier to use the `ProteinNetDataset` class from the data module:

```python
from proteinnetpy.data import ProteinNetDataset
data = ProteinNetDataset(path="path/to/proteinnet")
```

This class includes a `preload` argument, which determines if the dataset is loaded into memory or streamed.
It also supports filtering using the `filter_func` argument, which is passed a function that returns truthy values for a record to determine if it is kept in the dataset.
A range of common filters are included in the data module, as well as `combine_filters()`, which can applies all passed filters to each record.

Once a dataset has been loaded it can be iterated over to process data.
The `ProteinNetMap` class creates map objects that map a function over the dataset, including options to stream the map on each iteration or pre-calculate results.
They have a `generate` method that creates a generator object yielding the output of the function.
The `LabeledFunction` class is provided to create functions annotated with output types and shapes, used for automatically creating TensorFlow datasets.
The `mutation` module provides some example functions that return mutated records.

The following example code shows a typical simple usage, creating a streamed TensorFlow dataset from ProteinNet data:

```python
from proteinnetpy import data
from proteinnetpy import tfdataset

class MapFunction(data.LabeledFunction):
    """
    Example ProteinNetMap function outputting a one-hot sequence and contact graph input data
    and multiple alignment PSSM labels
    """
    def __init__(self):
        self.output_shapes = (([None, 20], [None, None]), [None, 20])
        self.output_types = (('float32', 'float32'), 'int32')

    def __call__(self, record):
        return (record.get_one_hot_sequence().T, record.distance_matrix()), record.evolutionary.T

filter_func = data.make_length_filter(min_length=32, max_length=2000)
data = data.ProteinNetDataset(path="path/to/proteinnet", preload=False)
pn_map = data.ProteinNetMap(data, map=MapFunction(), static=False, filter_errors=True)

tf_dataset = tfdataset.proteinnet_tf_dataset(pn_map, batch_size=100, prefetch=400, shuffle_buffer=200)
```

Many more functions, arguments and uses are available, with detailed descriptions currently found in docstrings.
Full documentation will be generated from these for a future release.

## Scripts

The package also provides convenience scripts for processing ProteinNet datasets:

* add_angles_to_proteinnet - Add extra fields to a ProteinNet file with φ, ψ and χ backbone/torsion angles
* proteinnet_to_fasta - Extract a fasta file with the sequences from a ProteinNet file
* filter_proteinnet - Filter a ProteinNet file to include/exclude records from a list of IDs

Detailed usage instructions for each can be found using the `-h` argument.
