🌌 FlowMeta Tutorial

Title: FlowMeta: Automated End-to-End Metagenomic Profiling Pipeline
Repository: https://github.com/SkinMicrobe/FlowMeta
Author: Dongqiang Zeng
Email: interlaken@smu.edu.cn

πŸš€ Welcome! This guide walks through installing, configuring, and running FlowMeta, the 10-stage metagenomic pipeline that links fastp β†’ Bowtie2 β†’ Kraken2/Bracken β†’ host filtering β†’ multi-format reporting. Use it to replicate Shotgun workflows on Linux servers, HPC clusters, or WSL.

🧰 1. Prerequisites

  1. Platform: Linux or WSL (Windows Subsystem for Linux). SSD/NVMe storage keeps Bowtie2 and Kraken2 fast.
  2. Python (β‰₯3.8): Recreate the recommended environment with Conda or mamba:
    conda env create -f environment.yml
    conda activate meta
    # or
    mamba env create -f environment.yml
    mamba activate meta
    Prefer an isolated virtual environment?
    python -m venv .venv
    source .venv/bin/activate  # Windows PowerShell: .venv\Scripts\activate
    pip install -r docs/quickstart-requirements.txt  # tailor as needed
  3. External tools: fastp, bowtie2, samtools, kraken2, bracken, pigz, seqkit. Ensure each command resolves from $PATH.
  4. Reference assets:
    • Bowtie2 index prefix (e.g. /mnt/db/GRCh38_noalt_as/GRCh38_noalt_as).
    • Kraken2 database directory containing hash.k2d, opts.k2d, taxo.k2d, and optional Bracken taxonomy tables.

πŸ“¦ 2. Installation Options

2.1 🌐 PyPI (preferred)

pip install flowmeta

Ideal once the package is published publicly; every user can install directly.

2.2 πŸ’Ύ Local wheel

pip install dist/flowmeta-0.1.5-py3-none-any.whl

Use when sharing a pre-built artifact inside secure networks.

2.3 πŸ”§ Build from source

pip install build

python -m build --wheel --sdist
pip install dist/flowmeta-0.1.5-py3-none-any.whl

Confirm README.md, README.zh.md, and docs/tutorial.html stay bundled in the sdist so downstream installs ship with documentation.

Packaging check: tar -tf dist/flowmeta-0.1.5.tar.gz (sdist) and python -m zipfile -l dist/flowmeta-0.1.5-py3-none-any.whl (wheel) quickly reveal missing files. βœ…

🧬 3. Database Preparation

3.1 πŸͺ Kraken2 reference libraries

  1. Visit https://benlangmead.github.io/aws-indexes/k2.
  2. Download and extract the desired database (e.g. k2_standard_20240112) to /mnt/db/k2ppf.
  3. Double-check the directory contains hash.k2d, opts.k2d, taxo.k2d, library_report.tsv, and related files.
  4. Optional performance boost: pass --shm_path /dev/shm/k2ppf so FlowMeta caches the DB in RAM.

3.2 🧠 Bowtie2 host index (GRCh38 example)

# fetch the genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz

# optional: remove alternative contigs
seqkit grep -rvp "alt|PATCH" GCA_000001405.28_GRCh38.p13_genomic.fna.gz > GRCh38_noalt.fna

# build Bowtie2 index
mkdir -p /mnt/db/GRCh38_noalt_as
bowtie2-build GRCh38_noalt.fna /mnt/db/GRCh38_noalt_as/GRCh38_noalt_as

# FlowMeta configuration
flowmeta_base ... --db_bowtie2 /mnt/db/GRCh38_noalt_as/GRCh38_noalt_as

πŸ” Swap the download URL and filtering logic to index other host species.

πŸš€ 4. Quick Command Overview

4.1 ⚑ Fast start

flowmeta_base \
  --input_dir /mnt/data/flowmeta/01-raw \
  --output_dir /mnt/data/flowmeta/flowmeta-out \
  --db_bowtie2 /mnt/db/GRCh38_noalt_as/GRCh38_noalt_as \
  --db_kraken /mnt/db/k2ppf \
  --threads 32 \
  --project_prefix GLOBAL-

The pipeline auto-detects paired FASTQ files, writes outputs to the specified workspace, and drops .task.complete flags for safe restarts.

4.2 πŸ› οΈ Flag cheat sheet

FlagPurpose
--input_dirRaw FASTQ directory (default: `01-raw`, expecting `_1/_2` suffixes).
--output_dirPipeline workspace; creates `02-qc` through `09-mpa`.
--db_bowtie2Bowtie2 index prefix used for host filtering.
--db_krakenKraken2 database directory.
--threadsTotal threads per sample.
--batchConcurrent samples processed in fastp/Kraken2.
--seToggle single-end mode.
--suffix1, --suffix2Override FASTQ suffixes when naming schemes differ.
--min_countBracken minimum count threshold (host filtering).
--skip_integrity_checksSkip FASTQ integrity checks to maximize speed (use only on trusted storage).
--check_resultEnable integrity checks in Steps 2 and 4 (off by default to save time).
--stepResume from a logical pipeline step (1–10).
--forceRe-run work from the specified step even if flags exist.
--skip_host_extractSkip exporting host reads in Step 5.
--no_shm, --shm_pathControl shared-memory staging of Kraken2 databases.

4.3 🎯 Common scenarios

πŸ“‚ 5. Output Layout

02-qc/       fastp reports + trimmed reads
03-hr/       Host-depleted FASTQ files
04-bam/      Bowtie2 BAM and index files
05-host/     Optional host FASTQ exports
06-ku/       Kraken2 reports (first pass)
07-bracken/  Bracken abundance tables
08-ku2/      Host-filtered rerun outputs
09-mpa/      Final merged OTU / MPA matrices

🧩 Each directory stores per-sample .task.complete files so you can resume safely after interruptions.

🧭 6. Step-by-Step Logic

ℹ️ Integrity checks in Steps 2 and 4 run only when you supply --check_result (default is off to save time) and keep them enabled. If you pass --skip_integrity_checks, all FASTQ integrity checks are skipped to maximize throughputβ€”use this only when storage/media is trusted. At startup the CLI also prints a concise path overview for all step directories.

  1. fastp quality control.
  2. fastp integrity verification (requires --check_result).
  3. Bowtie2 host filtering (FASTQ + BAM outputs).
  4. Host-removed FASTQ validation (requires --check_result).
  5. Optional host read export (samtools fastq).
  6. Kraken2 database staging (shared memory).
  7. Kraken2 + Bracken classification.
  8. Kraken report validation.
  9. Host-taxid filtering and Bracken rerun.
  10. Final OTU/MPA/Bracken matrix merges.

πŸ”„ Steps map to functions in flowmeta/steps/ for advanced customization and can be resumed with --step.

StepPurposeFiles counted when announced
1fastp trimming/QC.FASTQ pairs in 01-raw matching suffix1.
2fastp integrity verification (requires --check_result)..task.complete or _fastp.json in 02-qc.
3Bowtie2 host depletion + BAM creation..task.complete in 02-qc.
4Host-removed FASTQ validation (requires --check_result)._host_remove_R1.fastq.gz in 03-hr.
5Optional samtools host-read export..bam files in 04-bam.
6Stage Kraken2 DB in shared memory (if enabled).N/A
7Kraken2/Bracken classification._host_remove_R1.fastq.gz in 03-hr.
8Kraken report validation..kraken.report.std.txt in 06-ku.
9Host-taxid filtering + Bracken rerun..kraken.report.std.txt in 06-ku.
10Merge OTU/MPA/Bracken outputs..nohuman.kraken.mpa.std.txt (08-ku2) + .bracken tables (07-bracken).

πŸ›‘οΈ 7. Logging & Troubleshooting

❓ 8. FAQ

  1. How do I rebuild after a database update?
    Use --step 3 --force (Bowtie2) or a later step depending on the scope.
  2. Can I add new samples incrementally?
    Drop FASTQ files into 01-raw; FlowMeta processes only the samples lacking .task.complete markers.
  3. Where are the fastp QC reports?
    Check 02-qc for both HTML reports and logs; ensure disk space is sufficient.
  4. CLI help?
    flowmeta_base -h prints the entire argument list.

🀝 9. Support & Citation

Maintainer: Dongqiang Zeng Β· Southern Medical University Β· interlaken@smu.edu.cn

Please cite the GitHub repository if FlowMeta contributes to your research: https://github.com/SkinMicrobe/FlowMeta.

πŸŽ‰ Happy sequencing!