Metadata-Version: 2.4
Name: vitax-rag
Version: 0.2.1
Summary: ViTax-RAG: Alignment-augmented language model for viral taxonomy classification
Author-email: Peak Zhou <your_email@xxx.edu>
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: torch>=1.12
Requires-Dist: numpy>=1.21
Requires-Dist: scikit-learn<2.0,>=1.0
Requires-Dist: pandas<3.0,>=1.3
Requires-Dist: biopython>=1.80
Requires-Dist: tqdm>=4.60
Requires-Dist: torchvision
Requires-Dist: transformers<5.0,>=4.26
Requires-Dist: tokenizers<0.16,>=0.13
Requires-Dist: einops>=0.6
Requires-Dist: omegaconf>=2.2
Requires-Dist: hydra-core>=1.2

# ViTax-RAG: Retrieval-Augmented Language Model for Viral Taxonomy Classification

## Introduction
ViTax-RAG is a hybrid viral taxonomy classification framework that integrates BLAST-based retrieval with a Bi-Hyena genomic language model to achieve adaptive, accurate, and robust taxonomic assignment from metagenomic sequences.
## Installation 
ViTax-RAG can be installed **either via pip (recommended)** or **from source**.

### External Dependency: NCBI BLAST+ (Optional but Recommended)

ViTax-RAG optionally uses NCBI BLAST+ for retrieval-augmented inference.

Install via conda (recommended)
```bash
conda install -c bioconda blast
```

Verify installation:
```bash
which blastn
blastn -version
```

### Option 1️⃣ (Recommended): Install via `pip`

```bash
pip install vitax-rag
```

This installs the ViTax-RAG command-line tool:

```bash
vitax-rag --help
```
### Required External Data
ViTax-RAG requires external BLAST database files, which are not packaged
inside the Python wheel.
```bash
git clone https://github.com/Ying-Lab/ViTax-Rag.git
cd ViTax-Rag
```

### Quick Start
Basic run (automatic device selection)
```bash
vitax-rag \
  --contigs test_contigs.fasta \
  --data_dir ./data \
  --out result.txt
```

### Option 2️⃣ : Install from Source (Development Mode)
Install dependencies:
```powershell
git clone https://github.com/Ying-Lab/ViTax-Rag.git
cd ViTax-Rag
pip install -r requirements.txt
```
Install NCBI BLAST+:
```powershell
# 1) Download the latest BLAST+ tarball (adjust version if needed)
mkdir -p ~/downloads && cd ~/downloads
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.12.0+-x64-linux.tar.gz

# 2) Extract and move to a permanent location
tar -zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz
mv ncbi-blast-2.12.0+ ~/blast+

# 3) Persist PATH (so BLAST commands are available in every shell)
echo 'export PATH="$HOME/blast+/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# 4) Verify installation
which blastn
blastn -version
```

## Quick Start

Basic run (auto device selection):
```powershell
python predict.py --contigs test_contigs.fasta --out output.txt --device auto
```
Adjust chunking and batch size for long contigs:
```powershell
python predict.py --contigs test_contigs.fasta --chunk_size 2000 --window_size 400 
```


## Command-Line Options

- `--contigs` FASTA input path, default `test_contigs.fasta`
- `--out` output predictions file, default `output.txt`
- `--confidence` confidence threshold, default `0.6`
- `--window_size` sliding step size, default `400`
- `--chunk_size` chunk length, default `2000`
- `--batch_size` batch size, default `64`
- `--rc` Use bidirectional prediction, default `true`
- `--augment` use BLAST augmentation, default `true`
- `--augment_len` target augmented length, default `4000`
- `--device` `auto`/`cpu`/`cuda`, default `auto`


## Output

- One line per input sequence; fields are space-separated: `sequence_id label confidence`
- `sequence_id`: taken from the FASTA header (text after `>` up to the first space)
- `label`: either `unclassified` or `TaxonName_TaxonLevel`
- `TaxonLevel`: one of `Genus`, `Family`, `Order`, `Class`; if genus confidence is below `--confidence`, the classifier backs off to higher levels
- `confidence`: normalized score in `[0, 1]`, formatted to two decimals; higher means stronger support for the predicted taxon

- Ordering: lines follow the order of sequences in the input FASTA
- Device and augmentation: enabling `--rc` (reverse complement) and `--augment` (BLAST) can change predictions and confidence
- Output file: written to the path specified by `--out` and printed as `Done, please check the output file: <path>`



