Metadata-Version: 2.4
Name: celljanus
Version: 0.1.4
Summary: CellJanus: Dual-Perspective Deconvolution of Host and Microbial Transcriptomes from FASTQ Data
Author: Zhaoqing Wang
License: MIT
Project-URL: Homepage, https://github.com/zhaoqing-wang/CellJanus
Project-URL: Repository, https://github.com/zhaoqing-wang/CellJanus
Project-URL: Changelog, https://github.com/zhaoqing-wang/CellJanus/blob/main/CHANGELOG.md
Project-URL: Bug Tracker, https://github.com/zhaoqing-wang/CellJanus/issues
Keywords: bioinformatics,metagenomics,single-cell,spatial-transcriptomics,microbiome,host-microbe,deconvolution,docker
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.1
Requires-Dist: rich>=13.0
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: matplotlib>=3.7
Requires-Dist: seaborn>=0.12
Requires-Dist: biopython>=1.81
Requires-Dist: requests>=2.31
Requires-Dist: tqdm>=4.65
Requires-Dist: psutil>=5.9
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Dynamic: license-file

<table>
  <tr>
    <td>
      <h1>CellJanus: Dual-Perspective Deconvolution of Host and Microbial Transcriptomes from FASTQ Data</h1>
      <p>
        <a href="https://pypi.org/project/celljanus/"><img src="https://img.shields.io/pypi/v/celljanus?color=blue&style=flat-square" alt="PyPI Version" /></a>
        <a href="https://pypi.org/project/celljanus/"><img src="https://img.shields.io/pypi/dm/celljanus?style=flat-square&label=Downloads" alt="PyPI Downloads" /></a>
        <a href="LICENSE"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License: MIT" /></a>
        <a href="https://www.python.org/"><img src="https://img.shields.io/badge/Python-3.9%2B-blue.svg" alt="Python 3.9+" /></a>
        <a href="https://github.com/zhaoqing-wang"><img src="https://img.shields.io/badge/Maintainer-Zhaoqing_Wang-green" alt="GitHub Maintainer" /></a>
      </p>
    </td>
    <td width="200">
      <img src="docs/Sticker.png" alt="CellJanus Logo" width="200" />
    </td>
  </tr>
</table>

## Pipeline

```
FASTQ ─→ fastp (QC) ─→ Bowtie2 (host alignment) ─→ unmapped reads
                             │                            │
                             ▼                            ▼
                    host_aligned.bam           Kraken2 + Bracken  ─→  plots (PNG/PDF) + CSV tables
                   (gene expression)        (microbial abundance)
```

## Contents

1. [Installation](#installation)
2. [Quick Start](#quick-start)
3. [Real Data](#real-data)
4. [CLI Reference](#cli-reference)
5. [Python API](#python-api)
6. [Output Structure](#output-structure)
7. [Citation](#citation)

---

## Installation

### Option 1: pip install from PyPI (recommended)

> CellJanus is a pure-Python orchestrator — `pip install` gets you the CLI and API immediately.  
> External bioinformatics tools (fastp, Bowtie2, etc.) are only needed at **runtime**, not at install time.

```bash
pip install celljanus

# Verify
celljanus --version
```

### Option 2: Conda environment with full pipeline tools

> Requires **Linux / macOS / WSL2**. Bioconda packages are not available on native Windows.

```bash
# Create conda environment with all external tools
conda create -n CellJanus -c bioconda -c conda-forge \
    python=3.11 fastp bowtie2 samtools kraken2 bracken

conda activate CellJanus
pip install celljanus          # install from PyPI

# Verify all tools
celljanus check
```

All tools should show **✔ Found**. STAR is optional (for future RNA-seq alignment support).

### Option 3: Install from source (development)

```bash
git clone https://github.com/zhaoqing-wang/CellJanus.git
cd CellJanus
pip install -e ".[dev]"        # editable install with dev dependencies
```

<details>
<summary>Option 4: Docker</summary>

```bash
docker build -t celljanus .
docker run --rm celljanus celljanus check
```
</details>

---

## Quick Start

The repository includes test data and pre-built reference databases — run the full pipeline immediately with **no downloads required**.

```bash
conda activate CellJanus
cd CellJanus

celljanus run \
    --read1 testdata/reads_R1.fastq.gz \
    --read2 testdata/reads_R2.fastq.gz \
    --host-index testdata/refs/host_genome/host \
    --kraken2-db testdata/refs/kraken2_testdb \
    --output-dir test_results \
    --threads 4
```

**Test data**: 1,000 paired-end reads (600 human, 300 microbial, 100 low-quality).

**Results** (~4 seconds):

| Step | Metric |
|------|--------|
| QC | 1,000 → 900 pairs retained (90%), Q20 improved 88% → 98% |
| Host alignment | 66.39% aligned to host genome |
| Classification | 300 reads classified → 3 species detected |
| Top species | *S. aureus* 38.7%, *K. pneumoniae* 31.3%, *E. coli* 30.0% |
| Output | 8 plots (PNG + PDF), 3 CSV tables, QC reports |

#### Example Output

| Pipeline Dashboard |
|:--:|
| ![Pipeline Dashboard](docs/pipeline_dashboard.png) |
| *Summarises QC, alignment and classification metrics in a single view.* |

| Abundance Bar Chart | Abundance Donut Chart | Abundance Heatmap |
|:--:|:--:|:--:|
| ![Bar](docs/abundance_bar.png) | ![Pie](docs/abundance_pie.png) | ![Heatmap](docs/abundance_heatmap.png) |
| Top species ranked by read count. | Relative proportion of each species. | Log₁₀-scaled heatmap of species abundance. |

### Run Individual Steps

```bash
# QC only
celljanus qc -1 testdata/reads_R1.fastq.gz -2 testdata/reads_R2.fastq.gz -o results/01_qc

# Align to host
celljanus align -1 results/01_qc/reads_R1_qc.fastq.gz \
    -2 results/01_qc/reads_R2_qc.fastq.gz \
    -x testdata/refs/host_genome/host -o results/02_alignment

# Classify microbial reads
celljanus classify -1 results/02_alignment/unmapped_R1.fastq.gz \
    -2 results/02_alignment/unmapped_R2.fastq.gz \
    -d testdata/refs/kraken2_testdb -o results/04_classification

# Generate plots
celljanus visualize -b results/04_classification/bracken_S.txt -o results/05_visualisation
```

---

## Real Data

### 1. Download reference databases

```bash
# Human genome hg38 + Bowtie2 index (~5 GB)
celljanus download hg38 -o ./refs

# Kraken2 standard database (~8 GB)
celljanus download kraken2 -o ./refs --db-name standard_8
```

### 2. Run pipeline

```bash
celljanus run \
    -1 /path/to/sample_R1.fastq.gz \
    -2 /path/to/sample_R2.fastq.gz \
    -x ./refs/bowtie2_index/GRCh38_noalt_as \
    -d ./refs/standard_8 \
    -o ./results \
    --threads 8
```

### Key Options

| Option | Default | Description |
|--------|---------|-------------|
| `-1, --read1` | *required* | R1 FASTQ (or single-end FASTQ) |
| `-2, --read2` | — | R2 FASTQ for paired-end |
| `-x, --host-index` | *required* | Bowtie2 index prefix |
| `-d, --kraken2-db` | *required* | Kraken2 database path |
| `-o, --output-dir` | `celljanus_output` | Output directory |
| `-t, --threads` | auto (CPUs − 2) | Worker threads |
| `--min-quality` | 15 | Phred quality threshold |
| `--confidence` | 0.05 | Kraken2 confidence |
| `--bracken-level` | S | Taxonomic level (D/P/C/O/F/G/S) |
| `--skip-qc` | — | Skip QC step |
| `--skip-classify` | — | Skip classification |
| `--skip-visualize` | — | Skip visualisation |

---

## CLI Reference

| Command | Description |
|---------|-------------|
| `celljanus run` | Full pipeline: QC → Align → Classify → Visualize |
| `celljanus qc` | Quality control (fastp) |
| `celljanus align` | Host alignment + unmapped extraction (Bowtie2) |
| `celljanus extract` | Extract unmapped reads from BAM |
| `celljanus classify` | Taxonomic classification (Kraken2 + Bracken) |
| `celljanus visualize` | Generate abundance plots |
| `celljanus download` | Download reference databases |
| `celljanus check` | Verify external tool installation |

Run `celljanus <command> --help` for full option details.

---

## Python API

```python
from pathlib import Path
from celljanus.config import CellJanusConfig
from celljanus.pipeline import run_pipeline

cfg = CellJanusConfig(
    output_dir=Path("./results"),
    host_index=Path("./refs/bowtie2_index/GRCh38_noalt_as"),
    kraken2_db=Path("./refs/standard_8"),
    threads=8,
)

result = run_pipeline(
    Path("sample_R1.fastq.gz"),
    read2=Path("sample_R2.fastq.gz"),
    cfg=cfg,
)

result.bracken_df          # Species abundance (pandas DataFrame)
result.qc_report.summary() # QC statistics
```

---

## Output Structure

```
output_dir/
├── 01_qc/                           # Quality control
│   ├── *_qc.fastq.gz               # Trimmed reads
│   ├── *_fastp.json                 # QC metrics
│   └── *_fastp.html                 # Interactive report
├── 02_alignment/                    # Host alignment
│   ├── host_aligned.sorted.bam      # Full alignment
│   ├── host_mapped.sorted.bam       # Host-only reads
│   ├── unmapped_R{1,2}.fastq.gz     # Non-host reads → classification
│   └── host_align_stats.txt         # Alignment statistics
├── 04_classification/               # Microbial classification
│   ├── kraken2_report.txt           # Taxonomic report
│   ├── kraken2_output.txt           # Per-read assignments
│   └── bracken_S.txt                # Species abundance
├── 05_visualisation/plots/          # Figures (PNG + PDF)
│   ├── abundance_bar.*              # Horizontal bar chart
│   ├── abundance_pie.*              # Donut chart
│   ├── abundance_heatmap.*          # Heatmap (log₁₀ scale)
│   └── pipeline_dashboard.*         # Summary dashboard
├── 06_tables/                       # Machine-readable results
│   ├── pipeline_summary.csv         # Per-step metrics
│   ├── species_abundance.csv        # Species × reads × fraction
│   └── output_manifest.csv          # File inventory with sizes
└── celljanus.log                    # Pipeline log
```

### CSV Tables

**`species_abundance.csv`**:

| name | taxonomy_id | bracken_estimated | fraction_pct |
|------|-------------|------------------:|--------------:|
| Staphylococcus aureus | 1280 | 116 | 38.67 |
| Klebsiella pneumoniae | 573 | 94 | 31.33 |
| Escherichia coli | 562 | 90 | 30.00 |

**`pipeline_summary.csv`**: one row per metric (Step, Metric, Value) covering QC, alignment, and classification statistics.

---

## Performance

| Component | Memory | Note |
|-----------|--------|------|
| fastp | < 1 GB | Streaming I/O |
| Bowtie2 + hg38 | ~3.5 GB | Memory-mapped index |
| Kraken2 (standard DB) | ~8 GB | `--memory-mapping` flag |
| **Peak total** | **~12–14 GB** | Fits a 32 GB laptop |

---

## Citation

```
Wang Z (2026). CellJanus: A Dual-Perspective Tool for Deconvolving Host
Single-Cell and Microbial Transcriptomes. Python package version 0.1.4.
https://github.com/zhaoqing-wang/CellJanus
```

## License

[MIT](LICENSE)
