Metadata-Version: 2.2
Name: pyreadstore
Version: 1.3.2
Summary: PyReadStore is the Python client (SDK) for the ReadStore API
Home-page: https://github.com/EvobyteDigitalBiology/pyreadstore
Author: Jonathan Alles
Author-email: Jonathan.Alles@evo-byte.com
License: Apache-2.0 license
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: Unix
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.32.3
Requires-Dist: pydantic>=2.10
Requires-Dist: pandas>=2.2
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

![GitHub Release](https://img.shields.io/github/v/release/EvobyteDigitalBiology/pyreadstore)
![PyPI - Version](https://img.shields.io/pypi/v/pyreadstore)
![Build Status](https://img.shields.io/badge/build-passing-brightgreen)


# PyReadStore SDK

This README describes PyReadStore, the Python client (SDK) for the ReadStore API. 

The full **ReadStore Basic documentation** is available [here](https://evobytedigitalbiology.github.io/readstore/) 

PyReadStore can be used to access Projects, Datasets, ProData as well as metadata and attachment files in the ReadStore Database from Python code. 
The package enables you to automate your bioinformatics pipelines, Python scripts and notebooks.

Check the [ReadStore Github repository](https://github.com/EvobyteDigitalBiology/readstore) for more information on how to get started with ReadStore and setting up your server.

More infos on the [ReadStore website](https://evo-byte.com/readstore/)

Tutorials and Intro Videos: https://www.youtube.com/@evobytedigitalbio

Blog posts and How-Tos: https://evo-byte.com/blog/

For general questions reach out to info@evo-byte.com or in case of technical problems to support@evo-byte.com

Happy analysis :)


## Table of Contents
- [Description](#description)
- [Installation](#installation)
- [ReadStore API](#api)
- [Usage](#usage)
    1. [Quickstart](#quickstart)
    2. [Client Config](#client_config)
    3. [Datasets](#access_datasets)
    4. [Project](#access_projects)
    5. [ProData](#access_prodata)
    6. [Download](#download_attach)
    7. [Upload FASTQ](#upload_fastq)
- [Contributing](#contributing)
- [License](#license)
- [Credits and Acknowledgments](#acknowledgments)

## The Lean Solution for Managing NGS and Omics Data

ReadStore is a platform for storing, managing, and integrating omics data. It speeds up analysis and offers a simple way of managing and sharing NGS omics datasets, metadata and processed data (**Pro**cessed **Data**).
Built-in project and metadata management structures your workflows and a collaborative user interface enhances teamwork — so you can focus on generating insights.

The integrated Webservice (API) enables your to directly retrieve data from ReadStore via the terminal [Command-Line-Interface (CLI)](https://github.com/EvobyteDigitalBiology/readstore-cli) or [Python](https://github.com/EvobyteDigitalBiology/pyreadstore) / [R](https://github.com/EvobyteDigitalBiology/r-readstore) SDKs.

The ReadStore Basic version provides a local webserver with a simple user management. If you need an organization-wide deployment, advanced user and group management or cloud integration please check the ReadStore Advanced versions and reach out to info@evo-byte.com.

## Description

PyReadStore is a Python client (SDK) that lets you easily connect to your ReadStore server and interact with the ReadStore API.
By importing the pyreadstore package in Python, you can quickly retrieve data from a ReadStore server.

This tool provides streamlined and standardized access to NGS datasets and metadata, helping you run analyses more efficiently and with fewer errors.
You can easily scale your pipelines, and if you need to migrate or move NGS data, updating the ReadStore database ensures all your workflows stay up-to-date.


## Security and Permissions<a id="backup"></a>

**PLEASE READ AND FOLLOW THESE INSTRUCTIONS CAREFULLY!**

### User Accounts and Token<a id="token"></a>

Using PyReadStore requires an active user account and a token (and a running ReadStore server). 

You should **never enter your user account password** when working with PyReadStore.

To retrieve your token:

1. Login to the ReadStore app via your browser
2. Navigate to `Settings` page and click on `Token`
3. You can regenerate your token anytime (`Reset`). This will invalidate the previous token

For uploading FASTQ files your user account needs to have `Staging Permission`.
You can check this in the `Settings` page of your account.
If you not have `Staging Permission`, ask your ReadStore server admin to grant you permission.

### Setting Your Credentials

You need to provide the PyReadStore client with valid ReadStore credentials.

There are different options

1. Load credentials from the ReadStore `config` file. 
The file is generated by the [ReadStore CLI](https://github.com/EvobyteDigitalBiology/readstore-cli),
by default in your home directory (`~/.readstore/`). Make sure to keep read permissions to the file restrictive

2. Directly enter your username and token when instantiating a PyReadStore client within your Python code

3. Set username and token via environment variables (`READSTORE_USERNAME`, `READSTORE_TOKEN`). This is useful in container or cloud environments.


## Installation

`pip3 install pyreadstore`

You can perform the install in a conda or venv virtual environment to simplify package management.

A local install is also possible

`pip3 install --user pyreadstore`
 

```python 
import pyreadstore
```

## ReadStore API<a id="api"></a>

The **ReadStore Basic** server provides a RESTful API for accessing resources via HTTP requests.  
This API extends the functionalities of the ReadStore CLI as well as the Python and R SDKs.

### API Endpoint
By default, the API is accessible at:  
`http://127.0.0.1:8000/api_x_v1/`

### Authentication
Users must authenticate using their username and token via the Basic Authentication scheme.

### Example Usage
Below is an example demonstrating how to use the ReadStore CLI to retrieve an overview of Projects by sending an HTTP `GET` request to the `project/` endpoint.  
In this example, the username is `testuser`, and the token is `0dM9qSU0Q5PLVgDrZRftzw`. You can find your token in the ReadStore settings.

```bash
curl -X GET -u testuser:0dM9qSU0Q5PLVgDrZRftzw http://localhost:8000/api_x_v1/project/
```

#### Example Reponse

A successful HTTP response returns a JSON-formatted string describing the project(s) in the ReadStore database. Example response:

```
[{
  "id": 4,
  "name": "TestProject99",
  "metadata": {
    "key1": "value1",
    "key2": "value2"
  },
  "attachments": []
}]
```

### Documentation

Comprehensive [API documentation](https://evobytedigitalbiology.github.io/readstore/rest_api/) is available in the ReadStore Basic Docs.


## Usage

Detailed tutorials, videos and explanations are found on [YouTube](https://www.youtube.com/playlist?list=PLk-WMGySW9ySUfZU25NyA5YgzmHQ7yquv) or on the [**EVO**BYTE blog](https://evo-byte.com/blog).

### Quickstart<a id="quickstart"></a>

Let's access some dataset and project data from the ReadStore database!

Make sure a ReadStore server is running and reachable (by default under `127.0.0.1:8000`).
You can enter (`http://127.0.0.1:8000/api_v1/`) in your browser and should get a response from the API.

We assume you ran `readstore configure` before to create a config file for your user.
If not, consult the [ReadStore CLI](https://github.com/EvobyteDigitalBiology/readstore-cli) README on how to set this up.

We will create a client instance and perform some operations to retrieve data from the ReadStore database.
More information on all available methods can be found below.


```python 
import pyreadstore

rs_client = pyreadstore.Client() # Create an instance of the ReadStore client

# Manage Datasets

datasets = rs_client.list()      # List all datasets and return pandas dataframe

datasets_project_1 = rs_client.list(project_id = 1) # List all datasets for project 1

datasets_id_25 = rs_client.get(dataset_id = 25)     # Get detailed data for dataset 25

# Manage Projects

projects = rs_client.list_projects()                # List all projects

projects = rs_client.get_project(project_name = 'MyProject') # Get details for MyProject

fastq_data_id_25 = rs_client.get_fastq(dataset_id = 25)     # Get fastq file data for dataset 25

rs_client.download_attachment(dataset_id = 25,              # Download files attached to dataset 25
                              attachment_name = 'gene_table.tsv') 

# Manage Processed Data

rs_client.upload_pro_data(name = 'sample_1_count_matrix',      # Set name of count matrix
                            pro_data_file = 'path/to/sample_1_counts.h5',   # Set file path
                            data_type = 'count_matrix',                     # Set type to 'count_matrix'
                            dataset_id = 25)                                # Set dataset id for upload

pro_data_project_1 = rs_client.list_pro_data(project_id = 1) # Get all ProData entries for Project 1

pro_data = rs_client.get_pro_data(name = 'sample_1_count_matrix',   # Set name to sample_1_count_matrix
                                dataset_id = 25)                    # dataset_id

pro_data_id = rs_client.delete_pro_data(name = 'sample_1_count_matrix',
                                        dataset_id = 25)

# Ingest FASTQ files

rs_client.upload_fastq(fastq = ['path/to_fastq_r1.fq', 'path/to_fastq_r2.fq'], # Upload a FASTQ files
                        fastq_name = ['sample_rep_1_r1', 'sample_rep_1_r2'],    # Set FASTQ names
                        read_type = ['R1', 'R2'])                               # Set individual FASTQ read types
```


### Configure the Python Client<a id="client_config"></a>

The Client is the central object and provides authentication against the ReadStore API.
By default, the client will try to read the `~/.readstore/config` credentials file.
You can change the directory if your config file is located in another folder.

If you set the `username` and `token` arguments, the client will use these credentials instead.

If your ReadStore server is not running under localhost (`127.0.0.1`) port `8000`, you can adapt the default settings.

```python 
pyreadstore.Client(config_dir: str = '~/.readstore',  # Directory containing ReadStore credentials
                  username: str | None = None,        # Username
                  token : str | None = None,          # Token
                  host: str = 'http://localhost',     # Hostname / IP of ReadStore server
                  return_type: str = 'pandas',        # Default return types, can be pandas or json
                  port: int = 8000,                   # Server Port Number
                  fastq_extensions: List[str] = ['.fastq','.fastq.gz','.fq','.fq.gz']) 
                  # Accepted FASTQ file extensions for upload validation 
```

Is is possible to set userame, token, server endpoint and fastq extensions using the listed environment variables. 
The enironment variables precede over other client configurations.

- `READSTORE_USERNAME` (username)
- `READSTORE_TOKEN` (token)
- `READSTORE_ENDPOINT_URL` (`http://host:post`, e.g. `http://localhost:8000`)
- `READSTORE_FASTQ_EXTENSIONS` (fastq_extensions, `'.fastq',.fastq.gz,.fq,.fq.gz'`)

**Possible errors**

    - Connection Error:     If no ReadStore server was found at the provided endpoint
    - Authentication Error: If provided username or token are not found
    - No Permission to Upload/Delete FASTQ/ProData: User has no [Staging Permissions]

### Access Datasets<a id="access_datasets"></a>

```python 
# List ReadStore Datasets

rs_client.list(project_id: int | None = None,   # Filter datasets for project with id `project_id`
              project_name: str | None = None,  # Filter datasets for project with name `project_name`
               return_type: str | None = None   # Return pd.DataFrame or JSON type
               ) -> pd.DataFrame | List[dict]

# Get ReadStore Dataset Details
# Provide dataset_id OR dataset_name

rs_client.get(dataset_id: int| None = None,     # Get dataset with id `dataset_id`
              dataset_name: str | None = None,  # Filter datasets with name `dataset_name`
              return_type: str | None = None    # Return pd.Series or json(dict)
              ) -> pd.Series | dict

# Get FASTQ file data for a dataset
# Provide dataset_id OR dataset_name

rs_client.get_fastq(dataset_id: int| None = None,    # Get fastq data for dataset with id `dataset_id`
                  dataset_name: str | None = None,   # Get fastq data for dataset `dataset_name`
                  return_type: str | None = None     # Return pd.Series or json(dict)
                  ) -> pd.DataFrame | List[dict]

# Return metadata for datasets in a dedicated pandas dataframe
# Metadata keys are pivoted as column, and values as rows 

rs_client.list_metadata(project_id: int | None = None,   # Subset by project_id
                        project_name: str | None = None  # Subset by project_name
                        ) -> pd.DataFrame:

```

### Edit Datasets

**NOTE** Editing methods as create or delete require `Staging Permission` authorization.

When creating datasets, the `name` argument and `metadata` dictionary are checked for consistency: Each must not be empty, contain only alphanumeric characters (plus _-.@). Metadata keys must not contain reserved keywords (listed below).

```python 
# Create an empty Dataset, without FASTQ files attached

# Name must be unique in Database
# Optionally define Project IDs and/or Project names to attach Dataset to.  

rs_client.create(dataset_name: str,                       # Set name
                 description: str = '',           # Set description. Defaults to ''.
                 project_ids: List[int] = [],     # Set project_ids. Defaults to [].
                 project_names: List[str] = [],   # Set project_names. Defaults to [].
                 metadata: dict = {})              # Set metadata. Defaults to {}.

# Update a Dataset
# Dataset_id must be provided to define the dataset to update.
# Only arguments where a new values is specied will be updated.
# Argument with None value remain unaltered.

rs_client.update(dataset_id: int,                 # Set ID to update
                dataset_name: str | None = None,  # Updated name (optional)
                description: str | None = None,   # Updated description (optional)
                project_ids: List[int] | None = None,   # Updated project_ids (optional)
                project_names: List[str] | None = None, # Updated project_names (optional)
                metadata: dict | None = None,           # Updated metadata (optional)

# Provide empty project_ids or project_names list [] to unset all associated projects

# Delete Dataset (and attached FASTQ files)
# Either dataset_id or dataset_name argument must be provided

rs_client.delete(dataset_id: int | None = None,   # Delete by ID. Defaults to None.
                 dataset_name: str | None = None)  # Delete by Name. Defaults to None.
```


### Access Projects<a id="access_projects"></a>

```python 
# List ReadStore Projects

rs_client.list_projects(return_type: str | None = None   # Return pd.DataFrame or JSON type
                        ) -> pd.DataFrame | List[dict]

# Get ReadStore Project Details
# Provide project_id OR project_name

rs_client.get_project(project_id: int| None = None,     # Get dataset with id `project_id`
                      project_name: str | None = None,  # Filter datasets with name `project_name`
                      return_type: str | None = None    # Return pd.Series or json(dict)
                      ) -> pd.Series | dict

# Return metadata for datasets in a dedicated pandas dataframe
# Metadata keys are pivoted as column, and values as rows 

rs_client.list_projects_metadata() -> pd.DataFrame:

```

### Edit Projects

**NOTE** Editing methods as create or delete require `Staging Permission` authorization. 

When creating datasets, the `name` argument and `metadata` dictionary are checked for consistency: Each must not be empty, contain only alphanumeric characters (plus _-.@). Metadata keys must not contain reserved keywords (listed below).

```python 
# Create ReadStore Project

# name must be unique in Database
# dataset_metadata_keys can be attached and will be set as default metadata keys for attached datasets

rs_client.create_project(project_name: str,                       # Set Project name
                         description: str = '',           # Set Project description. Defaults to ''.
                         metadata: dict = {},             # Set Project metadata. Defaults to {}.
                         dataset_metadata_keys: List[str] = [])  # Set dataset metadata keys. Defaults to [].

# Update a Project
# Project_id must be provided to define the project to update.
# Only arguments where a new values is specied will be updated.
# Argument with None value remain unaltered.

rs_client.update_project(project_id: int,                # Set project id to update
                         project_name: str | None = None, # Updated name (optional)
                         description: str | None = None,  # Updated description (optional)
                         metadata: dict | None = None,    # Updated metadata (optional)
                         dataset_metadata_keys: List[str] | None = None) # Updated metadata keys (optional)

# Delete ReadStore Project
# Either project_id or project_name argument must be provided

rs_client.delete_project(project_id: int | None = None,    # Delete by ID. Defaults to None.
                         project_name: str | None = None)  # Delete by Name. Defaults to None.
```

### Access **Pro**cessed **Data**<a id="access_prodata"></a>

```python 
# Upload Processed Data

rs_client.upload_pro_data(name: str,                # Name of ProData
                        pro_data_file: str,         # Set ProData file path
                        data_type: str,             # Set ProData data type
                        description: str = '',      # Description for ProData
                        metadata: dict = {},        # MetaData
                        dataset_id: int | None = None,  # Dataset ID to assign ProData to
                        dataset_name: str | None = None)# Dataset Name to assign ProData to

# Must provide dataset_id or dataset_name

# List and filter Processed Data

rs_client.list_pro_data(project_id: int | None = None,      # Filter by Project ID
                        project_name: str | None = None,    # Filter by Project Name
                        dataset_id: int | None = None,      # Filter by Dataset ID
                        dataset_name: str | None = None,    # Filter by Dataset Name
                        name: str | None = None,            # Filter by ProData name
                        data_type: str | None = None,       # Filter by ProData data type
                        include_archived: bool = False,     # Include archived
                        return_type: str | None = None) -> pd.DataFrame | List[dict]

# Get individual ProData entry

rs_client.get_pro_data(pro_data_id: int | None = None,  # Get ProData by ID
                        dataset_id: int | None = None,  # Get ProData by Dataset ID
                        dataset_name: str | None = None, # Get ProData by Dataset Name
                        name: str | None = None,        # Get ProData by Name ID
                        version: int | None = None,     # Get specific verion, None returns latest valid version
                        return_type: str | None = None) -> pd.Series | dict

# Provide ID or Name + Dataset ID/Name

# Get metadata from ProData entries

rs_client.list_pro_data_metadata(project_id: int | None = None, # Subset by project ID
                                project_name: str | None = None, # Subset by project name
                                dataset_id: int | None = None,   # Subset by Dataset ID
                                dataset_name: str | None = None, # Subset by Dataset Name
                                name: str | None = None,         # Subset by ProData Name
                                data_type: str | None = None,    # Subset by ProData Type
                                include_archived: bool = False  # Include Archived entries
                                ) -> pd.DataFrame

# Delete ProData entry

rs_client.delete_pro_data(pro_data_id: int | None = None,   # Delete by ProData ID
                        dataset_id: int | None = None,      # Delete by Dataset ID
                        dataset_name: str | None = None,    # Delete by Dataset Name
                        name: str | None = None,            # Delete by name
                        version: int | None = None):        # Delete specific version

# Provide ID or Name + Dataset ID/Name for delete
```

### Download Attachments<a id="download_attach"></a>

```python 
# Download project attachment file from ReadStore Database 

rs_client.download_project_attachment(attachment_name: str,            # name of attachment file
                                      project_id: int | None = None,   # project id with attachment
                                      project_name: str | None = None, # project name with attachment
                                      outpath: str | None = None)      # Path to download file to

# Download dataset attachment file from ReadStore Database 

rs_client.download_attachment(attachment_name: str,             # name of attachment file
                              dataset_id: int | None = None,    # datatset id with attachment
                              dataset_name: str | None = None,  # datatset name with attachment
                              outpath: str | None = None)       # Path to download file to
```

### Upload FASTQ files<a id="upload_fastq"></a>

Upload FASTQ files to ReadStore server. The methods checks if the FASTQ files exist and end with valid FASTQ ending.

```python 
# Upload FASTQ files to ReadStore 

rs_client.upload_fastq(fastq : List[str] | str)  # Path of FASTQ files to upload
```

#### Reserved keywords

The following keywords must not be used as metadata keys

```
'id','name','project','project_ids','project_names','owner_group_name','qc_passed','paired_end',
'index_read','created','description','owner_username','fq_file_r1','fq_file_r2','fq_file_i1',
'fq_file_i2','id_project','name_project','name_og','archived','collaborators','dataset_metadata_keys',
'data_type','version','valid_to','upload_path','owner_username','fq_dataset','id_fq_dataset','name_fq_dataset'
```

## Contributing

Contributions make this project better! Whether you want to report a bug, improve documentation, or add new features, any help is welcomed!

### How You Can Help
- Report Bugs
- Suggest Features
- Improve Documentation
- Code Contributions

### Contribution Workflow
1. Fork the repository and create a new branch for each contribution.
2. Write clear, concise commit messages.
3. Submit a pull request and wait for review.

Thank you for helping make this project better!

## License

The pyreadstore is licensed under an Apache 2.0 Open Source License.
See the LICENSE file for more information.

## Credits and Acknowledgments<a id="acknowledgments"></a>

pyreadstore is built upon the following open-source python packages and would like to thank all contributing authors, developers and partners.

- Python (https://www.djangoproject.com/)
- requests (https://requests.readthedocs.io/en/latest/)
- pydantic (https://docs.pydantic.dev/latest/)
- pandas (https://pandas.pydata.org/)
