Metadata-Version: 2.4
Name: useq2tracks
Version: 1.0.0
Summary: Universal Sequencing-to-Genome-Browser-Tracks Pipeline
Author: Patrick Grady
License-Expression: MIT
Project-URL: Homepage, https://github.com/pgrady1322/uSeq2Tracks
Project-URL: Repository, https://github.com/pgrady1322/uSeq2Tracks
Project-URL: Issues, https://github.com/pgrady1322/uSeq2Tracks/issues
Keywords: bioinformatics,genomics,ucsc,snakemake,pipeline
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Provides-Extra: yaml
Requires-Dist: pyyaml>=6.0; extra == "yaml"
Dynamic: license-file

# uSeq2Tracks: Universal Sequencing to Browser Tracks Pipeline

[![Version](https://img.shields.io/badge/version-1.0.0-blue.svg)](https://github.com/pgrady1322/uSeq2Tracks/releases)
[![CI](https://github.com/pgrady1322/uSeq2Tracks/actions/workflows/ci.yml/badge.svg)](https://github.com/pgrady1322/uSeq2Tracks/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/useq2tracks)](https://pypi.org/project/useq2tracks/)
[![Snakemake](https://img.shields.io/badge/snakemake-≥6.0.0-brightgreen.svg)](https://snakemake.github.io)
[![Nextflow DSL2](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A521.10.3-23aa62.svg?labelColor=000000)](https://www.nextflow.io/)
[![Python](https://img.shields.io/badge/python-≥3.10-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

A comprehensive pipeline for processing diverse sequencing datasets and generating standardized genomic tracks for UCSC Genome Browser visualization. uSeq2Tracks handles everything from raw sequencing data to publication-ready browser tracks with minimal user intervention.

**Available in two implementations:**
- **Snakemake** — stable, feature-complete
- **Nextflow DSL2** — cloud/HPC-optimized with Docker & Singularity support

## 🌟 Overview

uSeq2Tracks is designed to standardize the processing of heterogeneous sequencing datasets, transforming raw sequencing data into organized, genome browser-ready tracks. Whether you're working with public datasets from ENCODE or your own experimental data, uSeq2Tracks provides a unified workflow that handles quality control, alignment, track generation, and browser hub creation.

### Key Features

- **Universal Input Support**: Handles both local FASTQ files and SRA accessions
- **Multiple Assay Types**: ChIP-seq, ATAC-seq, CUT&RUN, RNA-seq, WGS, Long-reads, etc.
- **Two Processing Modes**: Standard (full QC) and Rapid (streamlined for public data)
- **Genome-ID Organization**: All outputs tagged with unique genome identifiers
- **UCSC Integration**: Automatic track hub generation for browser visualization
- **Quality Control**: Comprehensive QC reports with FastQC, MultiQC, and assay-specific metrics
- **Flexible Configuration**: Extensive parameter customization for each assay type

## 🧬 Supported Assay Types

| Assay Type | Purpose | Key Outputs | Peak Calling |
|------------|---------|-------------|--------------|
| **ChIP-seq** | Histone modifications, TF binding | BigWig tracks, narrowPeak files | MACS3 |
| **ATAC-seq** | Chromatin accessibility | BigWig tracks, narrowPeak files | MACS3 |
| **CUT&RUN** | Low-input chromatin profiling | BigWig tracks, narrowPeak files | MACS3 |
| **RNA-seq** | Gene expression | BigWig tracks, count matrices | N/A |
| **WGS** | Genome-wide sequencing | BigWig coverage, variant calls | N/A |
| **Ancient DNA** | Historical/archaeological samples | BigWig tracks, damage analysis | N/A |
| **Nanopore** | Long-read sequencing | BigWig tracks, structural variants | N/A |
| **PacBio** | Long-read sequencing | BigWig tracks, high-accuracy variants | N/A |

## 🚀 Quick Start

### 1. Installation

```bash
# Clone the repository
git clone https://github.com/pgrady1322/uSeq2Tracks.git
cd uSeq2Tracks

# Create the conda environment (bioinformatics tools + dependencies)
conda env create -f envs/useq2tracks.yml
conda activate useq2tracks

# Install the useq2tracks CLI
pip install -e .

# Or install the CLI from PyPI (without pipeline files)
# pip install useq2tracks
```

### 2. Configuration

Edit the main configuration file:

```yaml
# config.yaml
samplesheet: "samples.csv"           # Your sample metadata
genome: "/path/to/genome.fa"         # Reference genome
genome_id: "galGal6"                 # REQUIRED: Unique genome identifier
gtf: "/path/to/annotations.gtf"      # Gene annotations (optional)
outdir: "./results"                  # Output directory
rapid_mode: false                    # Set to true for streamlined processing
```

### 3. Sample Sheet Setup

Create a sample sheet describing your data:

**Standard Mode Example** (comprehensive QC):
```csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
sample1,chipseq,SRR123456,,,H3K27ac,replicate1,treatment
sample2,atacseq,,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,accessibility,replicate1,control
sample3,rnaseq,,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz,expression,timepoint1,control
```

**Rapid Mode Example** (public datasets):
```csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
ENCODE_H3K27ac_rep1,chipseq,SRR1536404,,,H3K27ac,ENCSR000EWQ,K562
ENCODE_H3K27ac_rep2,chipseq,SRR1536405,,,H3K27ac,ENCSR000EWQ,K562
ENCODE_Input,chipseq,SRR1536406,,,H3K27ac,input,input
```

### 4. Run the Pipeline

```bash
# ── Using the useq2tracks CLI (recommended) ──

# Snakemake (default engine)
useq2tracks run                              # all cores, config.yaml
useq2tracks run --cores 16                   # limit to 16 cores
useq2tracks run --dryrun                     # preview what will run
useq2tracks run --rapid                      # skip QC, essential tracks only
useq2tracks run --profile slurm              # submit to SLURM
useq2tracks run --resume                     # resume interrupted run

# Nextflow
useq2tracks run --engine nextflow
useq2tracks run --engine nextflow --profile docker
useq2tracks run --engine nextflow --resume

# Forward extra engine-specific flags after '--'
useq2tracks run -- --printshellcmds          # Snakemake shell echo
useq2tracks run --engine nextflow -- -with-tower   # Nextflow Tower

# ── Other CLI subcommands ──

# Validate inputs before running
useq2tracks validate samples.csv --configfile config.yaml

# Show pipeline info
useq2tracks info

# Generate a standalone UCSC track hub
useq2tracks hub --genome-id galGal6 --hub-name MyHub \
  --hub-short-label "My Hub" --hub-long-label "My Sequencing Hub" \
  --genome-name galGal6 --hub-email user@example.com \
  --bigwigs *.bw --output-dir ucsc_hub/

# ── Or invoke engines directly (advanced) ──
snakemake --use-conda --cores all
nextflow run main.nf -profile docker --genome_id galGal6
```

---

## 🔄 Nextflow Implementation

uSeq2Tracks includes a **Nextflow DSL2** pipeline for improved scalability, cloud integration, and HPC support. The Nextflow implementation provides all core functionality with enhanced portability and parallelization.

### Why Use Nextflow?

**Advantages:**
- ✅ **Better Parallelization**: Automatic task-level optimization
- ✅ **Improved Resume**: More robust caching and resume capability
- ✅ **Cloud Native**: Built-in support for AWS, Google Cloud, Azure
- ✅ **HPC Ready**: First-class SLURM, SGE, PBS support
- ✅ **Container First**: Excellent Docker/Singularity integration
- ✅ **Portable**: Works identically across different systems

### Nextflow Quick Start

#### 1. Install Nextflow

```bash
curl -s https://get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/
```

#### 2. Configure Pipeline

Create a `params.yaml` file:

```yaml
# Input/output
samplesheet: "samples.csv"
genome: "/path/to/genome.fa"
genome_id: "galGal6"  # REQUIRED
outdir: "./results"

# Pipeline mode
rapid_mode: false

# ATAC-seq settings
atacseq:
  mapper: "bowtie2"
  markdup: false
  macs3_opts: "--qval 0.05"
  bw_norm: "CPM"

# ChIP-seq settings
chipseq:
  mapper: "bowtie2"
  control_tag: "input"
  markdup: false

# UCSC hub
ucsc:
  hub_name: "myHub"
  hub_short_label: "My Data"
  hub_long_label: "My Sequencing Data Hub"
  genome_name: "galGal6"
  hub_email: "pgrady1322@gmail.com"
```

#### 3. Run Nextflow Pipeline

```bash
# Navigate to nextflow directory
cd nextflow/

# With Docker (local)
nextflow run main.nf -profile docker -params-file params.yaml

# With Singularity (HPC)
nextflow run main.nf -profile singularity -params-file params.yaml

# With SLURM + Singularity
nextflow run main.nf -profile slurm,singularity -params-file params.yaml

# Rapid mode
nextflow run main.nf -profile rapid,docker -params-file params.yaml

# Resume failed run
nextflow run main.nf -profile docker -resume
```

### Nextflow Features

**Implemented Workflows:**
- ✅ ATAC-seq workflow with Tn5 shift correction
- ✅ ChIP-seq workflow with control matching
- ✅ CUT&RUN workflow
- ✅ Genome preparation and indexing
- ✅ UCSC track hub generation
- ✅ Samplesheet validation
- ✅ Dynamic resource allocation
- ✅ Multiple execution profiles

**Snakemake-only (not yet ported to Nextflow):**
- RNA-seq workflow
- WGS workflow
- Long-read workflows (Nanopore, PacBio)
- Ancient DNA workflow
- SRA download integration
- FastQC/MultiQC integration
- Replicate merging

### Nextflow Profiles

**Container Profiles:**
- `docker`: Use Docker containers (recommended for local)
- `singularity`: Use Singularity containers (recommended for HPC)
- `conda`: Use Conda environments

**Executor Profiles:**
- `local`: Run on local machine (default)
- `slurm`: Submit to SLURM scheduler
- `sge`: Submit to SGE scheduler
- `pbs`: Submit to PBS scheduler

**Special Profiles:**
- `test`: Run with minimal test dataset
- `rapid`: Skip QC, generate essential tracks only

**Combine profiles with commas:**
```bash
nextflow run main.nf -profile slurm,singularity  # SLURM + Singularity
nextflow run main.nf -profile docker,rapid       # Docker + Rapid mode
```

### Nextflow Configuration Example

```yaml
# params.yaml for Nextflow
samplesheet: "samples.csv"
genome: "/path/to/genome.fa"
genome_id: "galGal6"

# Resource limits
max_cpus: 64
max_memory: "256.GB"
max_time: "240.h"

# ATAC-seq parameters
atacseq:
  mapper: "bowtie2"
  bowtie2_opts: "--very-sensitive"
  markdup: false
  shift: -75
  extsize: 150
  bw_norm: "CPM"

# ChIP-seq parameters  
chipseq:
  mapper: "bowtie2"
  control_tag: "input"
  macs3_opts: "--qval 0.05 --keep-dup all"
  bw_norm: "CPM"
```

### Nextflow Output Structure

```
results/
└── galGal6/                    # Genome ID
    ├── atacseq/
    │   ├── bam/
    │   ├── peaks/
    │   └── bigwig/
    ├── chipseq/
    │   ├── bam/
    │   ├── peaks/
    │   └── bigwig/
    ├── ucsc/
    │   ├── hub.txt
    │   ├── genomes.txt
    │   └── trackDb.txt
    └── pipeline_info/         # Execution reports
        ├── execution_report.html
        ├── execution_timeline.html
        └── execution_trace.txt
```

### Nextflow Quick Reference

**Basic Commands:**
```bash
# Run pipeline
nextflow run main.nf -profile docker -params-file params.yaml

# Resume failed run
nextflow run main.nf -profile docker -resume

# Override parameters
nextflow run main.nf --genome my_genome.fa --genome_id myGenome

# Generate reports
nextflow run main.nf -with-report -with-timeline -with-dag

# Limit concurrent jobs
nextflow run main.nf -profile slurm -qs 50
```

**Troubleshooting:**
```bash
# Check configuration
nextflow config main.nf

# Test run
nextflow run main.nf -profile test,docker

# Clean and restart
rm -rf work/ && nextflow run main.nf
```

### Choosing Between Snakemake and Nextflow

| Feature | Snakemake | Nextflow |
|---------|-----------|----------|
| **Maturity** | Stable, feature-complete | Stable (core assays) |
| **Learning Curve** | Python-based (easier for most) | Groovy-based |
| **Parallelization** | Good | Better (automatic) |
| **Cloud Support** | Via plugins | Native |
| **HPC Integration** | Good | Excellent |
| **Resume Capability** | Good | Excellent |
| **Container Support** | Good | Excellent |
| **Best For** | General bioinformatics | HPC/Cloud deployments |

**Recommendation:**
- Use **Snakemake** for: All assay types, Python familiarity, full feature set
- Use **Nextflow** for: HPC/cloud environments, containerized execution, core epigenomic assays

### Nextflow Documentation

For complete Nextflow documentation, see:
- `nextflow/QUICK_REFERENCE.md` — Quick command reference
- `nextflow/examples/` — Example configurations

---

## 📊 Pipeline Modes

### Standard Mode (Default)
**Purpose**: Comprehensive analysis with full quality control

**Features**:
- ✅ Complete FastQC reports for all samples
- ✅ Adapter trimming with detailed summaries
- ✅ MultiQC comprehensive aggregate report
- ✅ Comprehensive track generation
- ✅ Parameter sweep analysis (optional)
- ✅ Replicate-merged composite tracks
- ✅ Detailed QC metrics and visualizations
- ✅ Genrich additional outputs (when enabled)

**Pipeline Components**:
- FastQC quality reports for raw reads
- Adapter trimming outputs and summaries
- MultiQC aggregated QC dashboard
- Primary pipeline outputs (BigWig, BAM, peaks)
- Composite tracks for replicate groups
- Parameter sweep comparisons
- Complete UCSC hub files

**Use Cases**:
- ✨ Novel experimental datasets requiring comprehensive QC
- ✨ Unknown sample quality scenarios
- ✨ Publication-ready analysis with full documentation
- ✨ Parameter optimization studies
- ✨ Research datasets needing complete audit trail

**Output Structure**:
```
results/{genome_id}/
├── qc/
│   ├── fastqc/           # Individual FastQC reports
│   └── multiqc_report.html
├── trimmed/              # Adapter trimming outputs
├── {assay}/              # Assay-specific tracks
├── ucsc/
│   ├── hub.txt
│   └── composite_trackDb.txt
└── genrich_sweep/        # Parameter sweeps (if enabled)
```

### Rapid Mode
**Purpose**: Streamlined processing for high-confidence datasets

**Features**:
- ⚡ Essential track generation only
- ⚡ Skipped QC reports (FastQC, MultiQC)
- ⚡ No adapter trimming summaries
- ⚡ No composite tracks for replicate groups
- ⚡ No parameter sweep outputs
- ⚡ No Genrich additional outputs
- ⚡ Core outputs: BigWig tracks, peak calls, basic UCSC hubs
- ⚡ Faster processing time (30-50% faster)
- ⚡ Reduced storage footprint
- ⚡ Simplified output structure

**Pipeline Components**:
- SRA downloads (if needed)
- Genome indexing
- Primary alignment and track generation
- Essential peak calling
- Basic UCSC hub files (no composites)
- Rapid completion tracking

**Use Cases**:
- 🚀 Public datasets (ENCODE, TCGA, GEO) with known quality
- 🚀 Quick browser track generation for visualization
- 🚀 Streamlined processing when QC is unnecessary
- 🚀 Fast turnaround for track sharing
- 🚀 Time-sensitive analyses

**Output Structure**:
```
results/{genome_id}/
├── {assay}/              # Essential tracks only
├── ucsc/
│   └── hub.txt           # Basic hub (no composites)
└── rapid/
    └── rapid_tracks_complete.txt  # Completion summary
```

**Activation**:
```yaml
rapid_mode: true
```

**Configuration Example**:
```yaml
# For public ENCODE datasets
genome_id: "hg38_ENCODE"
samplesheet: "encode_samples.csv"
rapid_mode: true              # Enable rapid mode
genome: "/data/hg38.fa"

# Pipeline will skip:
# - FastQC reports
# - MultiQC aggregation
# - Adapter trimming summaries
# - Composite track generation
```

### Mode Comparison Table

| Feature | Standard Mode | Rapid Mode |
|---------|---------------|------------|
| **FastQC Reports** | ✅ Yes | ❌ Skipped |
| **MultiQC Dashboard** | ✅ Yes | ❌ Skipped |
| **Adapter Trimming** | ✅ Full summaries | ❌ No summaries |
| **BigWig Tracks** | ✅ Yes | ✅ Yes |
| **Peak Calling** | ✅ Yes | ✅ Yes |
| **UCSC Hub** | ✅ With composites | ✅ Basic only |
| **Parameter Sweeps** | ✅ Optional | ❌ Skipped |
| **Composite Tracks** | ✅ Yes | ❌ Skipped |
| **Processing Time** | Baseline | 30-50% faster |
| **Storage Usage** | Full | Reduced |
| **Best For** | Research data | Public datasets |

### Backward Compatibility

- **Default behavior unchanged**: `rapid_mode` defaults to `false`
- **Existing configs work**: No breaking changes to current setups
- **Progressive enhancement**: Rapid mode is opt-in feature
- **Output tagging**: All outputs tagged with `genome_id` regardless of mode

### ✨ Benefits of Dual-Mode Architecture

**Rapid Mode Benefits**:
1. ⚡ **30-50% faster processing** for public datasets
2. 💾 **Reduced storage footprint** without QC intermediates  
3. 🎯 **Cleaner output** focused on essential browser tracks
4. 🚀 **Quick turnaround** for time-sensitive visualization needs
5. 📊 **Streamlined workflows** for known high-quality data

**Standard Mode Benefits**:
1. 📈 **Complete audit trail** for research datasets
2. 🔬 **Comprehensive quality assessment** for novel samples
3. 📚 **Publication-ready** with full documentation
4. 🔍 **Deep quality insights** via MultiQC aggregation
5. 🎛️ **Parameter optimization** capabilities

**Flexibility Advantages**:
- Switch between modes based on data source
- Mix rapid and standard processing in same project
- Maintain quality standards while optimizing efficiency
- Preserve comprehensive analysis when needed

## 📁 Output Structure

The pipeline generates genome-tagged outputs with the following structure:

### Standard Mode Output (rapid_mode: false)

```
results/
└── {genome_id}/                    # e.g., galGal6/
    ├── genome/                     # Genome indices
    │   ├── genome.fa              # Reference genome
    │   ├── star/                  # STAR index
    │   ├── bowtie2/               # Bowtie2 index
    │   └── bwa_mem2/              # BWA-MEM2 index
    ├── qc/                        # Quality control
    │   ├── fastqc/                # FastQC reports (HTML + zip)
    │   │   ├── sample1_fastqc.html
    │   │   └── sample1_fastqc.zip
    │   └── multiqc_report.html    # Aggregated QC report
    ├── trimmed/                   # Adapter trimming (if enabled)
    │   ├── sample1_R1_trimmed.fq.gz
    │   ├── sample1_R2_trimmed.fq.gz
    │   └── trimming_reports/
    ├── {assay_type}/              # Per-assay outputs
    │   ├── bam/                   # Aligned reads
    │   │   ├── sample1.sorted.bam
    │   │   └── sample1.sorted.bam.bai
    │   ├── bigwig/                # Coverage tracks
    │   │   ├── sample1.bw
    │   │   └── merged_replicates.bw
    │   └── peaks/                 # Peak calls (when applicable)
    │       ├── sample1_peaks.narrowPeak
    │       └── merged_peaks.bed
    ├── ucsc/                      # UCSC track hubs
    │   ├── hub.txt                # Main hub file
    │   ├── genomes.txt            # Genome specification
    │   ├── trackDb.txt            # Track definitions
    │   └── composite_trackDb.txt  # Composite track definitions
    ├── genrich_sweep/             # Parameter sweeps (if enabled)
    │   ├── qval_0.05/
    │   ├── qval_0.01/
    │   └── qval_0.001/
    └── logs/                      # Pipeline logs
        ├── alignment/
        ├── peak_calling/
        └── track_generation/
```

### Rapid Mode Output (rapid_mode: true)

```
results/
└── {genome_id}/                    # e.g., galGal6/
    ├── genome/                     # Genome indices (same as standard)
    │   ├── genome.fa
    │   ├── star/
    │   ├── bowtie2/
    │   └── bwa_mem2/
    ├── {assay_type}/              # Essential tracks only
    │   ├── bam/                   # Aligned reads
    │   │   └── sample1.sorted.bam
    │   ├── bigwig/                # Coverage tracks
    │   │   └── sample1.bw
    │   └── peaks/                 # Peak calls (when applicable)
    │       └── sample1_peaks.narrowPeak
    ├── ucsc/                      # Basic UCSC hub
    │   ├── hub.txt                # Main hub file
    │   ├── genomes.txt            # Genome specification
    │   └── trackDb.txt            # Track definitions (no composites)
    ├── rapid/                     # Rapid mode tracking
    │   └── rapid_tracks_complete.txt  # Completion summary with stats
    └── logs/                      # Pipeline logs
        ├── alignment/
        ├── peak_calling/
        └── track_generation/
```

### Output Differences Summary

| Output Component | Standard Mode | Rapid Mode |
|------------------|---------------|------------|
| **QC Reports** | Full FastQC + MultiQC | Skipped |
| **Trimming Outputs** | Detailed summaries | No summaries |
| **Composite Tracks** | Generated | Skipped |
| **Parameter Sweeps** | Optional | Skipped |
| **Hub Complexity** | With composites | Basic only |
| **Storage Footprint** | Full | ~30-40% smaller |
| **Completion Marker** | Standard | `rapid_tracks_complete.txt` |

### Key Output Files

**BigWig Tracks** (`.bw`):
- Genome-wide coverage tracks for UCSC browser
- Normalized by CPM, RPKM, or custom methods
- Strand-specific for RNA-seq (when applicable)

**Peak Files** (`.narrowPeak`, `.broadPeak`):
- BED-format files with enriched regions
- MACS3 output with q-values and fold-enrichment
- Optional Genrich peaks for ATAC-seq

**BAM Files** (`.bam`):
- Sorted and indexed aligned reads
- Optional duplicate marking
- Quality filtered (MAPQ thresholds applied)

**UCSC Hub Files**:
- `hub.txt`: Hub metadata and contact info
- `genomes.txt`: Genome assembly specifications
- `trackDb.txt`: Individual track configurations
- `composite_trackDb.txt`: Grouped track configurations (standard mode only)

## 🔧 Configuration Details

### Required Parameters

```yaml
# Essential settings - must be configured
samplesheet: "samples.csv"          # Sample metadata file
genome: "/path/to/genome.fa"        # Reference genome FASTA
genome_id: "your_genome_id"         # REQUIRED: Unique identifier
outdir: "./results"                 # Output directory
```

### Sample Sheet Format

| Column | Description | Example | Required |
|--------|-------------|---------|----------|
| `sample_id` | Unique sample identifier | `H3K27ac_rep1` | Yes |
| `type` | Assay type | `chipseq`, `atacseq`, `rnaseq` | Yes |
| `sra_id` | SRA accession (if downloading) | `SRR123456` | If no local files |
| `read1` | Path to R1 FASTQ | `data/sample_R1.fastq.gz` | If no SRA |
| `read2` | Path to R2 FASTQ | `data/sample_R2.fastq.gz` | For paired-end |
| `experiment_group` | Experimental grouping | `H3K27ac`, `timepoint1` | No |
| `replicate_group` | Replicate grouping | `replicate1` | No |
| `condition` | Sample condition | `treatment`, `control` | No |

### Assay-Specific Parameters

#### ChIP-seq Configuration
```yaml
chipseq:
  mapper: "bowtie2"                 # bowtie2 or bwa_mem2
  bowtie2_opts: "--very-sensitive"
  markdup: false                    # Mark duplicates
  macs3_opts: "--qval 0.05 --keep-dup all"
  control_tag: "input"              # Identify control samples
  bw_norm: "CPM"                    # BigWig normalization
```

#### ATAC-seq Configuration
```yaml
atacseq:
  mapper: "bowtie2"
  markdup: false                    # Recommended: false for accessibility
  shift: -75                        # MACS3 shift for ATAC-seq
  extsize: 150                      # MACS3 extension size
  macs3_opts: "--qval 0.05"
```

#### RNA-seq Configuration
```yaml
rnaseq:
  mapper: "star"                    # star or hisat2
  star_opts: "--outFilterMultimapNmax 20"
  strand_specific: false
  gene_bed: "genes.bed"             # For QC analysis
```

### Advanced Features

#### Parameter Sweep
Test multiple peak-calling thresholds:
```yaml
parameter_sweep:
  enabled: true
  qvalues: [0.05, 0.01, 0.005, 0.001]
```

#### Adapter Trimming
Enable quality-based trimming:
```yaml
adapter_trimming:
  enabled: true
  min_length: 20
  quality_cutoff: 20
```

#### Alternative Peak Callers
Enable Genrich for ATAC-seq:
```yaml
## 🧮 Computational Requirements

### Resource Recommendations

| Dataset Size | CPU Cores | Memory | Storage | Time Estimate |
|--------------|-----------|---------|---------|---------------|
| Small (< 10 samples) | 8-16 | 32 GB | 100 GB | 2-6 hours |
| Medium (10-50 samples) | 16-32 | 64 GB | 500 GB | 6-24 hours |
| Large (50+ samples) | 32-64 | 128 GB | 1 TB+ | 1-3 days |

### Cluster Configuration

The pipeline includes SLURM integration via `Executor.sh`. Customize for your cluster:

```bash
#SBATCH --job-name=uSeq2Tracks
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mem=10g
#SBATCH --cpus-per-task=4
```

## 📖 Detailed Workflows

### ChIP-seq/ATAC-seq/CUT&RUN Workflow
1. **Quality Control**: FastQC analysis of raw reads
2. **Adapter Trimming**: Optional quality-based trimming with fastp
3. **Genome Indexing**: Build mapper-specific indices (Bowtie2/BWA-MEM2)
4. **Read Mapping**: Align reads to reference genome
5. **Post-processing**: Sort, index, optional duplicate marking
6. **Coverage Tracks**: Generate normalized BigWig files
7. **Peak Calling**: Identify enriched regions with MACS3
8. **Quality Metrics**: Generate assay-specific QC reports

### RNA-seq Workflow
1. **Quality Control**: FastQC analysis of raw reads
2. **Adapter Trimming**: Optional preprocessing with fastp
3. **Genome Indexing**: Build STAR or HISAT2 indices
4. **Read Mapping**: Splice-aware alignment to reference
5. **Quantification**: Generate gene-level count matrices
6. **Coverage Tracks**: Create strand-specific BigWig files
7. **Quality Assessment**: RNA-seq specific QC with RSeQC

### Long-read Workflow (Nanopore/PacBio)
1. **Quality Assessment**: Basic statistics and length distributions
2. **Genome Indexing**: Build Minimap2 index
3. **Read Mapping**: Long-read aware alignment
4. **Coverage Analysis**: Generate coverage tracks
5. **Variant Calling**: Optional structural variant detection

## 🎨 UCSC Browser Integration

### Automatic Track Hub Generation

uSeq2Tracks automatically creates UCSC-compatible track hubs:

**Hub Structure**:
```
ucsc/
├── hub.txt                 # Hub metadata
├── genomes.txt             # Genome definitions
└── trackDb.txt             # Track configurations
```

**Track Organization**:
- **Composite Tracks**: Group related samples (e.g., same experiment)
- **Subgroups**: Organize by condition, replicate, timepoint
- **Color Coding**: Consistent color schemes per assay type
- **Metadata Integration**: Sample information in track descriptions

### Loading in UCSC Browser

1. Upload track hub files to web-accessible location
2. In UCSC Browser: `My Data` → `Track Hubs` → `My Hubs`
3. Enter hub URL: `https://your-server.com/path/to/ucsc/hub.txt`
4. Browse your data with full metadata integration

## 🔍 Quality Control Features

### Standard QC Reports
- **FastQC**: Per-sample quality metrics
- **MultiQC**: Aggregated quality dashboard
- **Mapping Statistics**: Alignment rates and quality scores
- **Library Complexity**: Duplication rates and insert sizes

### Assay-Specific QC
- **ChIP-seq**: Fragment length distributions, enrichment metrics
- **ATAC-seq**: TSS enrichment, fragment size profiles
- **RNA-seq**: Gene body coverage, junction analysis
- **WGS**: Coverage uniformity, variant quality metrics

### Quality Thresholds
The pipeline includes built-in quality checks:
- Minimum mapping rates
- Fragment count requirements for peak calling
- Insert size validation
- Strand specificity assessment

## 🛠 Troubleshooting

### Common Issues

#### Genome ID Not Set
```
ERROR: genome_id must be set in config.yaml
```
**Solution**: Add `genome_id: "your_genome"` to config.yaml

#### Sample Sheet Formatting
```
ERROR: Missing required columns in sample sheet
```
**Solution**: Ensure sample sheet includes `sample_id`, `type`, and either `sra_id` or local file paths

#### Memory Issues
```
ERROR: Job exceeded memory limit
```
**Solution**: Increase memory allocation in config.yaml:
```yaml
memory:
  large: 128000    # Increase for memory-intensive jobs
```

#### Disk Space
```
ERROR: No space left on device
```
**Solution**: 
- Clean up intermediate files: `snakemake --delete-temp-output`
- Use scratch storage for temporary files
- Monitor disk usage during execution

### Performance Optimization

#### Speed Up Processing
1. **Enable Rapid Mode** for public datasets
2. **Increase Parallelization**: More `--jobs` in Snakemake
3. **Use SSDs** for scratch space
4. **Optimize Resource Allocation**: Match CPU/memory to job requirements

#### Reduce Storage
1. **Delete Intermediate Files**: Use `--delete-temp-output`
2. **Compress Outputs**: Enable compression for BAM files
3. **Archive Unused Data**: Move completed analyses to long-term storage

## 📚 Examples

### Example 1: ENCODE ChIP-seq Analysis (Rapid Mode)
**Scenario**: Processing public ENCODE data for quick visualization

```yaml
# config.yaml
genome_id: "hg38_ENCODE_H3K27ac"
genome: "/data/genomes/hg38.fa"
samplesheet: "encode_chipseq.csv"
rapid_mode: true                     # Skip QC for public data

chipseq:
  mapper: "bowtie2"
  macs3_opts: "--qval 0.01"
```

```csv
# encode_chipseq.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
ENCSR000EWQ_rep1,chipseq,SRR1536404,,,H3K27ac,ENCSR000EWQ,K562
ENCSR000EWQ_rep2,chipseq,SRR1536405,,,H3K27ac,ENCSR000EWQ,K562
ENCSR000EWQ_input,chipseq,SRR1536406,,,H3K27ac,input,input
```

**Expected Output**:
```
results/hg38_ENCODE_H3K27ac/
├── chipseq/
│   ├── bam/
│   ├── bigwig/
│   └── peaks/
├── ucsc/
│   └── hub.txt
└── rapid/
    └── rapid_tracks_complete.txt
```

**Processing Time**: ~2-3 hours (vs 4-6 hours in standard mode)

### Example 2: Multi-assay Developmental Study (Standard Mode)
**Scenario**: Novel experimental data requiring comprehensive QC

```yaml
# config.yaml
genome_id: "mm10_development"
genome: "/data/genomes/mm10.fa"
gtf: "/data/annotations/mm10.gtf"
samplesheet: "development_study.csv"
rapid_mode: false                    # Full QC for novel data

parameter_sweep:
  enabled: true
  qvalues: [0.1, 0.05, 0.01, 0.001]
  
adapter_trimming:
  enabled: true
  min_length: 20
```

```csv
# development_study.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
E10_ATAC_rep1,atacseq,,data/E10_ATAC_1_R1.fq.gz,data/E10_ATAC_1_R2.fq.gz,E10,ATAC_rep1,E10
E10_ATAC_rep2,atacseq,,data/E10_ATAC_2_R1.fq.gz,data/E10_ATAC_2_R2.fq.gz,E10,ATAC_rep2,E10
E10_RNA_rep1,rnaseq,,data/E10_RNA_1_R1.fq.gz,data/E10_RNA_1_R2.fq.gz,E10,RNA_rep1,E10
E12_ATAC_rep1,atacseq,,data/E12_ATAC_1_R1.fq.gz,data/E12_ATAC_1_R2.fq.gz,E12,ATAC_rep1,E12
E12_RNA_rep1,rnaseq,,data/E12_RNA_1_R1.fq.gz,data/E12_RNA_1_R2.fq.gz,E12,RNA_rep1,E12
```

**Expected Output**:
```
results/mm10_development/
├── qc/
│   ├── fastqc/
│   └── multiqc_report.html
├── trimmed/
├── atacseq/
├── rnaseq/
├── ucsc/
│   ├── hub.txt
│   └── composite_trackDb.txt
└── genrich_sweep/
```

**Processing Time**: ~8-12 hours with full QC

### Example 3: TCGA Cancer Atlas (Rapid Mode)
**Scenario**: Rapid processing of TCGA RNA-seq data

```yaml
# config.yaml
genome_id: "hg38_TCGA_BRCA"
genome: "/data/genomes/hg38.fa"
gtf: "/data/annotations/gencode.v38.gtf"
samplesheet: "tcga_rnaseq.csv"
rapid_mode: true                     # Fast track generation

rnaseq:
  mapper: "star"
  strand_specific: true
```

```csv
# tcga_rnaseq.csv
sample_id,type,sra_id,read1,read2,experiment_group,replicate_group,condition
TCGA_BRCA_01,rnaseq,SRR8494716,,,BRCA,tumor_01,tumor
TCGA_BRCA_02,rnaseq,SRR8494717,,,BRCA,tumor_02,tumor
TCGA_BRCA_normal,rnaseq,SRR8494718,,,BRCA,normal_01,normal
```

**Benefits**: 
- ⚡ 40% faster processing
- 💾 50% less storage (no QC intermediates)
- 🎯 Clean output for browser visualization

### Example 4: Ancient DNA Analysis (Standard Mode)
**Scenario**: Archaeological samples requiring quality validation

```yaml
# config.yaml
genome_id: "ancientDNA_sample"
genome: "/data/reference/ancient_ref.fa"
samplesheet: "ancient_samples.csv"
rapid_mode: false                    # Need full QC for damage assessment

ancientdna:
  mapper: "bwa_aln"
  markdup: true
  damage_analysis: true
  min_mapq: 30
```

**Why Standard Mode**:
- Ancient DNA has unique quality issues (damage patterns)
- Need comprehensive QC to assess sample preservation
- Publication requires full quality documentation

### Example 5: Mixed-Mode Project
**Scenario**: Combining public and experimental data

```yaml
# config_public.yaml (rapid mode)
genome_id: "mm10_public_controls"
samplesheet: "public_controls.csv"
rapid_mode: true

# config_experimental.yaml (standard mode)
genome_id: "mm10_experimental"
samplesheet: "experimental_samples.csv"
rapid_mode: false
```

**Workflow**:
1. Process public controls rapidly for quick validation
2. Process experimental data with full QC
3. Combine tracks in single UCSC hub
4. Maintain appropriate quality standards for each dataset type

## 🤝 Contributing

Contributions are welcome! To get started:

```bash
git clone https://github.com/pgrady1322/uSeq2Tracks.git
cd uSeq2Tracks
conda env create -f envs/useq2tracks.yml
conda activate useq2tracks
pip install -e ".[dev]"
make test        # run tests
make check       # lint + format check
```

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 📞 Support

- **Documentation**: [GitHub Wiki](https://github.com/pgrady1322/uSeq2Tracks/wiki)
- **Issues**: [GitHub Issues](https://github.com/pgrady1322/uSeq2Tracks/issues)
- **Discussions**: [GitHub Discussions](https://github.com/pgrady1322/uSeq2Tracks/discussions)

## 🏆 Citation

If you use uSeq2Tracks in your research, please cite:

```
Patrick Grady (2026). uSeq2Tracks: A universal pipeline for sequencing
data to genome browser tracks. https://github.com/pgrady1322/uSeq2Tracks
```

## 🙏 Acknowledgments

- **Snakemake Community**: For the excellent workflow management system
- **Bioconda**: For streamlined software distribution
- **UCSC Genome Browser**: For track hub specifications
- **Tool Developers**: FastQC, MultiQC, STAR, Bowtie2, BWA, MACS3, and all other integrated tools

---

**uSeq2Tracks**: From raw sequencing data to publication-ready browser tracks in one streamlined workflow.
