Metadata-Version: 2.1
Name: bio-datasets
Version: 0.0.4
Summary: Open-source collection of biology datasets and pre-trained embeddings.
Home-page: https://github.com/DeepChainBio/datasets
Author: InstaDeep
Author-email: a.delfosse@instadeep.com
License: Apache-2.0
Platform: UNKNOWN
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Software Development
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: google-cloud-storage (==1.37.1)
Requires-Dist: pandas (==1.2.4)
Requires-Dist: tqdm (==4.60.0)

<p align="center">
  <img width="50%" src="./.source/_static/deepchain.png">
</p>


![PyPI](https://img.shields.io/pypi/v/bio-datasets)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.7](https://img.shields.io/badge/python-3.7-blue.svg)](https://www.python.org/downloads/release/python-360/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
![Dependencies](https://img.shields.io/badge/dependencies-up%20to%20date-brightgreen.svg)

# Bio-datasets
Open-source collection of biology datasets and pre-trained embeddings. :dna: :closed_book:

## Description
bio-datasets is a collaborative framework that allows the user to fetch publicly available sequence-based protein datasets.
For these datasets, pre-trained contextual embeddings are also available.


## Installation
Install the required dependencies with `pip install bio-datasets`.

# How it works

```python
from biodatasets import list_datasets, load_dataset

print(list_datasets())

# Load your dataset
pathogen = load_dataset("pathogen")

# Display the available columns and embeddings
print(pathogen)

# Get data from your dataset
X, y = pathogen.to_npy_arrays(input_names=["sequence"], target_names=["class"])
embeddings = pathogen.get_embeddings("sequence", "protbert", "cls")

# Get a full description of your dataset
pathogen.display_description()
```

# How to contribute
Check out how to setup the project or **add a public dataset** in [CONTRIBUTING.md](CONTRIBUTING.md).


