Metadata-Version: 2.1
Name: protein-information-system
Version: 3.1.2
Summary: Comprehensive Python Module for Protein Data Management: Designed for streamlined integration and processing of protein information from both UniProt and PDB. Equipped with features for concurrent data fetching, robust error handling, and database synchronization.
Author: frapercan
Author-email: frapercan1@alum.us.es
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: bio (>=1.8.0,<2.0.0)
Requires-Dist: esm (>=3.2.1,<4.0.0)
Requires-Dist: gemmi (>=0.7.3,<0.8.0)
Requires-Dist: h5py (>=3.12.1,<4.0.0)
Requires-Dist: mini3di (>=0.2.1,<0.3.0)
Requires-Dist: pandas (>=2.3.1,<3.0.0)
Requires-Dist: pgvector (>=0.4,<0.5)
Requires-Dist: pika (>=1.3.2,<2.0.0)
Requires-Dist: psycopg2-binary (>=2.9.9,<3.0.0)
Requires-Dist: pyyaml (>=6.0.1,<7.0.0)
Requires-Dist: retry (>=0.9.2,<0.10.0)
Requires-Dist: sentencepiece (>=0.2.0,<0.3.0)
Requires-Dist: sqlalchemy (>=2.0.40,<3.0.0)
Requires-Dist: tokenizer (>=3.4.3,<4.0.0)
Requires-Dist: torch (>=2.3.0,<3.0.0)
Requires-Dist: transformers (>=4.48.1,<5.0.0)
Description-Content-Type: text/markdown

[![PyPI - Version](https://img.shields.io/pypi/v/protein-information-system)](https://pypi.org/project/protein-information-system/)
[![Documentation Status](https://readthedocs.org/projects/protein-information-system/badge/?version=latest)](https://protein-information-system.readthedocs.io/en/latest/?badge=latest)
![Linting Status](https://github.com/CBBIO/protein-information-system/actions/workflows/test-lint.yml/badge.svg?branch=main)
[![codecov](https://codecov.io/gh/CBBIO/protein-information-system/branch/main/graph/badge.svg)](https://codecov.io/gh/CBBIO/protein-information-system)

# **Protein Information System (PIS)**

**Protein Information System (PIS)** is an integrated biological information system focused on extracting, processing, and managing protein-related data. PIS consolidates data from **UniProt**, **PDB**, and **GOA**, enabling the efficient retrieval and organization of protein sequences, structures, and functional annotations.

The primary goal of PIS is to provide a robust framework for large-scale protein data extraction, facilitating downstream functional analysis and annotation transfer. The system is designed for **high-performance computing (HPC) environments**, ensuring scalability and efficiency.


## 📈 **Current State of the Project**

### **FANTASIA: Functional Annotation Toolkit**


> 🧠 **FANTASIA** was built on top of the Protein Information System (PIS) as an advanced tool for **functional protein annotation** using embeddings generated by protein language models.
>
> [🔗 FANTASIA Repository](https://github.com/CBBIO/FANTASIA)
>
> The pipeline supports high-performance computing (HPC) environments and integrates tools such as ProtT5, ESM, and CD-HIT. These models can be extended or replaced with new variants **without modifying the core software structure**, simply by adding the new model to the PIS. This design enables scalable, modular, and reproducible GO term annotation from FASTA sequence files.


### **Protocol for Large-Scale Metamorphism and Multifunctionality Search**

> 🔍 In addition, a systematic protocol has been developed for the **large-scale identification of structural metamorphisms** and **protein multifunctionality**.
>
> [🔗 Metamorphic and multifunctionality Search Repository](https://github.com/CBBIO/metamorphic_multifunctional_search)
> 
> This protocol leverages the full capabilities of PIS to uncover non-obvious relationships between structure and function. **Structural metamorphisms** are detected by filtering large-scale structural alignments between proteins with high sequence identity, identifying divergent conformations. **Multifunctionality** is addressed through a semantic analysis of GO annotations, computing a functional distance metric to determine the two most divergent terms within each GO category per protein.

---

## **📡 Installing the BioData Lookup Table (Two Options)**

This guide shows two ways to load and use the **BioData** lookup table:

1. **Option A** - Manually download the PostgreSQL backup from **Zenodo** and restore it yourself (no PIS required).
2. **Option B** - Clone the **Protein Information System (PIS)** repository and let its helper script set everything up.

Both options end with the same result: a PostgreSQL database called `BioData` running with the `pgvector` extension enabled.

---

## 📚 Prerequisites

- A machine with:
        - Docker installed and running.
        - At least ~25-30 GB of free disk space (the backup itself is large).
- PostgreSQL client tools installed on your host:
        - `psql`, `createdb`, `dropdb`, `pg_restore`
        - Recommended: PostgreSQL 16+ client tools.
- Credentials used in this guide:
        - PostgreSQL user: `usuario`
        - PostgreSQL password: `clave`
        - Database name: `BioData`

> Adjust credentials if you use different ones.

---

## Option A - Manual Setup from Zenodo (without PIS)

### 1. Start the pgvector PostgreSQL container

```bash
docker run -d --name pgvectorsql \
    -e POSTGRES_USER=usuario \
    -e POSTGRES_PASSWORD=clave \
    -e POSTGRES_DB=BioData \
    -p 5432:5432 \
    pgvector/pgvector:pg16
```

This starts PostgreSQL with pgvector on `localhost:5432`.

---

### 2. Download the BioData backup from Zenodo

1. Open the Zenodo record in your browser, for example:
        - Final-layer table: `https://zenodo.org/records/17795871`
        - Early+final layers table: `https://zenodo.org/records/17793273`
2. In the **Files** section, locate the `.backup` file you want, e.g.:
        - `BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layer0.backup`
        - or
        - `BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup`
3. Click **Download** and **save the file** to a known location, for example:

```bash
~/biodata_backups/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup
```

> Do this via the browser to avoid Zenodo's cookie/redirect issues. The file should be multi-GB in size, not a few KB.

---

### 3. Drop and recreate the BioData database

On your host, using the PostgreSQL client tools (connecting to the Docker container):

```bash
export PGPASSWORD="clave"

# 1) Try to drop the database if it exists
dropdb -h localhost -U usuario BioData --if-exists

# 2) If there are still active connections, terminate them
psql -h localhost -U usuario -d postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'BioData' AND pid <> pg_backend_pid();"

dropdb -h localhost -U usuario BioData --if-exists

# 3) Final termination attempt (if needed) and drop
psql -h localhost -U usuario -d postgres \
    -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'BioData';"

sleep 2
dropdb -h localhost -U usuario BioData --if-exists

# 4) Recreate BioData
createdb -h localhost -U usuario BioData
```

---

### 4. Enable pgvector extension

```bash
psql -h localhost -U usuario -d BioData \
    -c "CREATE EXTENSION IF NOT EXISTS vector;"
```

---

### 5. Restore the BioData backup

```bash
export PGPASSWORD="clave"

pg_restore -h localhost -U usuario \
    -d BioData \
    ~/biodata_backups/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup
```

If restore succeeds, you now have the BioData database ready to use.

---

### 6. Connecting to BioData

- Using `psql`:

```bash
PGPASSWORD="clave" psql -h localhost -U usuario -d BioData
```

- Typical connection URL for applications:

```
postgresql://usuario:clave@localhost:5432/BioData
```

Use this string in your tools, notebook, or pipeline that needs to query the lookup table.

---

## Option B - Using the PIS Repository and Helper Script

If you also want the **Protein Information System** (PIS) and its automation around the database, use this method.

### 1. Clone the repository

```bash
cd /path/where/you/want/the/repo
git clone https://github.com/CBBIO/protein-information-system.git
cd protein-information-system
```

---

### 2. Set the Zenodo URL in pis_launcher_script.sh

At the top of pis_launcher_script.sh, set:

```bash
ZENODO_URL="https://zenodo.org/records/17793273/files/BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup?download=1"
```

(or the URL of the specific `.backup` you want from the **Files** section.)

The script will:

- Derive the filename from this URL.
- Download to the configured backup folder if it does not exist.
- Reuse the local file on subsequent runs (no re-download).

---

### 3. Run the self-check with rebase from Zenodo

From the repository root:

```bash
bash pis_launcher_script.sh --rebase-from-zenodo
```

This script will:

1. Check that Docker is running.
2. Ensure the `pgvectorsql` container (PostgreSQL + pgvector) and `rabbitmq` container exist and are running.
3. Download the BioData backup from Zenodo (or reuse the existing file in the configured backup folder).
4. **Drop and recreate** the `BioData` database on `localhost:5432`.
5. Enable the `vector` extension.
6. Run `pg_restore` from the downloaded backup.

If the size check fails (file looks too small), it will stop and tell you to correct `ZENODO_URL` or manually download the backup into the configured backup folder.

> With `--rebase-from-zenodo`, the script focuses on the DB rebase and then exits, so you get a clean BioData database ready to use.

---

### Script Options

Common flags for `pis_launcher_script.sh`:

- `--rebase-from-zenodo`: Download (or reuse) the Zenodo backup and restore it.
- `--rebase-from-backup`: Restore from a local backup file.
- `--zenodo-url=...`: Override the Zenodo URL used for download.
- `--backup-folder=...`: Folder where backups are stored/loaded.
- `--backup-file-name=...`: Backup filename to use inside the backup folder.
- `--database-name=...`: Target database name (default: `BioData`).
- `--check-services` or `--check-services-only`: Only check Docker and container status without a restore.

---

### 4. Use the database

After the script completes successfully:

- Connect with `psql` as in Option A:

```bash
PGPASSWORD="clave" psql -h localhost -U usuario -d BioData
```

- Or point your applications to:

```
postgresql://usuario:clave@localhost:5432/BioData
```

PIS itself can then use this database for its embedding and lookup workflows.

---

If you want, I can also draft a short "Troubleshooting" section for Notion (e.g. `pg_restore` version issues, port conflicts on 5432, etc.).

---

## **Get started:**

To execute the full extraction process, install dependencies and run from project root:

```bash
pis
```

This command will trigger the complete workflow, starting from the initial data preprocessing stages and continuing through to the final data organization and storage.

## **Customizing the Workflow:**

You can customize the sequence of tasks executed by modifying `main.py` or adjusting the relevant parameters in the `config.yaml` file. This allows you to tailor the extraction process to meet specific research needs or to experiment with different data processing configurations.


