Metadata-Version: 2.4
Name: geo-upload-tool
Version: 2.2.3
Summary: CLI tool for preparing data submission to Gene Expression Omnibus
Project-URL: Homepage, https://github.com/BU-Neuromics/gut
Project-URL: Repository, https://github.com/BU-Neuromics/gut
Author-email: Adam Labadorf <labadorf@bu.edu>
License: MIT
Keywords: NGS,sequencing
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.12
Requires-Dist: docopt>=0.6.2
Requires-Dist: openpyxl>=3.1.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: snakemake>=8.0.0
Provides-Extra: bio
Requires-Dist: pysam>=0.22.0; extra == 'bio'
Provides-Extra: cluster
Requires-Dist: snakemake-executor-plugin-cluster-generic>=1.0.0; extra == 'cluster'
Requires-Dist: snakemake-executor-plugin-slurm>=0.4.0; extra == 'cluster'
Provides-Extra: dev
Requires-Dist: black>=26.3.1; extra == 'dev'
Requires-Dist: mypy>=1.8.0; extra == 'dev'
Requires-Dist: pre-commit>=3.6.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.2.0; extra == 'dev'
Description-Content-Type: text/markdown

# gut - gEO upload tool

Last updated: 2026-03-20

GEO is amazing. Uploading data to GEO is not. I wrote this tool to ease the
pain of preparing all of the files and metadata associated with uploading a
dataset of high throughput sequencing data to GEO. The tool makes the process
much less manual, tedious, and error prone, as it requires well structured
tabular input that can be checked automatically for common problems.

## Installation & Prerequisites

**Requirements:**
- Python 3.12 or higher (breaking change from previous versions which supported Python 3.4+)
- [STAR aligner](https://github.com/alexdobin/STAR) for paired-end insert size calculation
- Recommended: [conda](https://docs.conda.io/) for installing bioinformatics tools (STAR, samtools)

Install using pip:

```bash
pip install geo-upload-tool
```

**Note:** For bioinformatics dependencies like STAR and samtools, we recommend using conda as these tools have system-level dependencies that are easier to manage through conda:

```bash
conda install -c conda-forge -c bioconda STAR samtools
```

## Getting Started

The quickest way to get started is to use the `gut init` command to create a new project directory with template files:

```bash
gut init my_geo_submission
cd my_geo_submission
```

This creates a directory with five template files:
- `sample_info.csv` - Template for sample metadata
- `file_info.csv` - Template for file paths and metadata
- `other_sections.csv` - Template for study, protocol, and data processing metadata
- `.env.template` - Template for FTP credentials
- `README.md` - Quick start guide

Edit the CSV files with your actual data, following the inline comments and examples. Then proceed with the workflow below.

## Basic Usage

The entire process is driven from two CSV files: *sample_info* and *file_info*.
Described below:

### Sample Info

This information makes up the SAMPLES section. The CSV should have exactly one
row per sample, and all of the following required columns:

  - **Sample name**: unique name for this sample
  - **source name**: sample source, e.g. brain
  - **organism**: name of the organism, e.g. human
  - **molecule**: one of a set of controlled vocabulary, listed below
  - **description** (optional): description of the sample, if desired

You may include as many more columns in the file as you like, and they will
all be added as **characteristic: tag** columns under the SAMPLES section.

NB: The **Sample name** column is used to cross reference with the file info,
which is discussed next.

### File Info

This information is used to derive the RAW FILES, PROCESSED DATA FILES, and
PAIRED-END EXPERIMENTS sections, as well as the **processed data file** and
**raw file** columns of the SAMPLES section. gut infers whether a file is
raw or processed based on the **rectype** column (see below). The CSV should
have at least one raw and at least one processed file per sample in the
sample info file (GEO requires this).

The raw files are always fastq files, and there should be one row
*per fastq* file per sample, e.g.:

```
Sample name,rectype,file type,instrument model,path,ref_fa,end
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R1.fastq.gz,hg38.fa,1
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R2.fastq.gz,hg38.fa,2
A2,SE fastq,fastq,Illumina HiSeq 2000,A2_R1.fastq.gz,hg38.fa,1
```

The processed files may be any other type of file, and there must be at
least one processed file for *each sample* in the sample info file, e.g.
continuing from above example:

```
A1,wig,na,na,A1.wig.gz,na,na
A1,csv,na,ns,raw_counts.csv,na,na
A2,wig,na,na,A2.wig.gz,na,na
A2,csv,na,ns,raw_counts.csv,na,na
```

Note the same file *raw_counts.csv* is provided for both samples, since
the raw counts matrix often contains processed data for all samples.
The CSV file must have all of the following columns with column names
in the first row:

  - **Sample name**: unique name for this sample, corresponds to sample info
  - **rectype**:
      - for RAW files: either "PE fastq" or "SE fastq"
      - for PROCESSSED files: anything appropriate for the file (e.g. csv,
        txt, wig, etc)
  - **file type**:
      - for RAW files: one of a controlled vocabulary, listed below
      - for PROCESSED files: value ignored
  - **instrument model**:
      - for RAW files: one of a controlled vocabulary, listed below required
      - for PROCESSED files: value ignored
  - **path**: the relative or absolute path to the file on your local system
  - **ref_fa**: (optional)
      - for RAW files only: a local path or URL to a fasta reference sequence
        that can be used to compute average insert size and standard deviation
        for paired-end datasets
      - Supports local paths and URLs (http://, https://, ftp://)
      - URLs are automatically downloaded and cached in `.cache/references/`
      - Example local path: `/path/to/reference.fa`
      - Example URL: `https://ftp.ensembl.org/pub/.../genome.fa.gz`
  - **end**: required only for rectype == "PE fastq": either 1 or 2 indicating
    the end of the fastq file
  - **alias**: (optional) a clean filename to use in the GEO submission directory
    instead of the original on-disk filename. When provided and non-empty, the
    symlink (or copy) created in the output directory will use the alias name, and
    the alias will appear as the file name in GEO metadata. This is useful when
    raw pipeline filenames contain run-specific tokens that are irrelevant to the
    submission.

    Example: if the file on disk is `DH-WT_IBH1-1_UNUSEFUL_INFO_R1_concat.fa.gz`,
    you can set `alias` to `IBH1-1_R1_concat.fa.gz` and it will be uploaded under
    that name. Alias values must be unique within the submission.

Any additional fields in the file info file are quite friendly ignored.

### Validate

With the above CSVs prepared, you can validate them, to make sure everything
lines up as expected:

```
gut validate -o my_geo_submission sample_info.csv file_info.csv
```

The `-o` argument is the name of the directory that will be created to stage
the files (GEO requires the directory be named the same as your email). The
validation logic checks to make sure everything lines up between your samples
and files, e.g. make sure all samples are in both files, each sample has both
raw and processed files, etc.


### Build

Once you have fixed all the problems and validation is successful, you can
build the staged directory:

```
gut build -o my_geo_submission sample_info.csv file_info.csv
```

This will do the following:

  1. Symlink (or copy with `--copy`) all of the raw and processed files into
     the staging directory
  2. Construct a metadata file with SAMPLES, PROCESSED DATA FILES, RAW FILES,
     and PAIRED-END EXPERIMENTS sections filled out appropriately, saved as an
     Excel file in the staging directory

The following optional flags enable additional processing steps via Snakemake:

  - `--md5` — Compute MD5 checksums on all files
  - `--readlen` — Identify read length and single- or paired-endedness for
    fastq files
  - `--inner-mate-distance` — Calculate average insert size and standard
    deviation for paired-end fastq files using STAR (requires `--ref-fa=FA`
    or **ref_fa** columns in file_info.csv)

If all went well, the file *metadata_TOFILL.xlsx* should exist inside the
staging directory. As the *TOFILL* part suggests, you need to fill it out
some more, as the other sections (e.g. SERIES) are not yet complete, unless
you provided the other sections with the `--addnl` CLI flag (see below). I
suggest you create a copy named *metadata_complete.xlsx* or something in the
staging directory and fill that out. Be on the lookout for errors and blank
fields; I surely didn't think to check for every possible mistake.

### Other Sections (SERIES, PROTOCOLS, DATA PROCESSING)

You may also provide a CSV file with the SERIES, PROTOCOLS, and DATA
PROCESSING PIPELINE sections to automate the remaining metadata. The `gut
init` command creates an `other_sections.csv` template with all supported
fields. Fill it in and provide it to gut with the `--addnl` CLI option:

```
gut build -o my_geo_submission sample_info.csv file_info.csv --addnl other_sections.csv
```

The resulting `metadata_TOFILL.xlsx` will have these fields incorporated,
and if you were thorough, you might not need to edit it at all. As per
below, the metadata files created by gut do not upload by default, so you
will still have to copy or rename the metadata file (e.g. to
`metadata.xlsx`) for gut to know to upload it.

### Using Reference Genomes from URLs

gut now supports downloading reference FASTA files directly from URLs, eliminating
the need to manually download large genome files. This is particularly useful for
paired-end insert size calculation.

**Supported URL types:**
- HTTP: `http://example.com/genome.fa`
- HTTPS: `https://ftp.ensembl.org/pub/.../genome.fa.gz`
- FTP: `ftp://ftp.ncbi.nlm.nih.gov/.../genome.fa`

**How it works:**
1. URLs are automatically detected when you provide them via `--ref-fa` flag or in the `ref_fa` column
2. Files are downloaded once and cached in `.cache/references/` directory
3. Subsequent builds reuse the cached file (no re-download)
4. Both local paths and URLs work interchangeably

**Example with CLI flag:**
```bash
gut build -o my_geo_submission \
  --ref-fa=https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
  sample_info.csv file_info.csv
```

**Example with per-sample URLs in file_info.csv:**
```csv
Sample name,rectype,file type,instrument model,path,ref_fa,end
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R1.fq.gz,https://example.com/hg38.fa.gz,1
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R2.fq.gz,https://example.com/hg38.fa.gz,2
A2,PE fastq,fastq,Illumina HiSeq 2000,A2_R1.fq.gz,ftp://ftp.ensembl.org/.../mm10.fa,1
A2,PE fastq,fastq,Illumina HiSeq 2000,A2_R2.fq.gz,ftp://ftp.ensembl.org/.../mm10.fa,2
```

**Cache management:**
- Cached files are stored in `.cache/references/` within your output directory
- Each cached file has a `.url` sidecar file storing the original URL
- To clean cache, simply delete files from `.cache/references/`
- Cache is verified before reuse (URL must match)

**Troubleshooting:**
- **Network timeout**: Large files (2-3GB) may take 15-30 minutes to download
- **404 error**: Verify the URL is correct and accessible in your browser
- **Firewall issues**: HTTP/HTTPS URLs are generally more reliable than FTP
- **Manual fallback**: You can always download the file manually and provide a local path

### Upload

Once you have filled in the missing metadata and put the new file into the
staging directory, you are ready to upload. You will have to initiate the
upload process from the GEO website and receive an upload directory, e.g.
`uploads/your@email.edu_mXoLeWqE`. An FTP client is built into python and
gut uses this to upload just the staged files.

#### Credential Management

For security, gut no longer accepts passwords as command-line arguments. Instead,
credentials can be provided through multiple secure methods:

**Method 1: Environment Variables (Recommended for CI/CD)**

```bash
export GEO_FTP_USER=geousername
export GEO_FTP_PASSWORD=geopassword
gut upload my_geo_submission uploads/your@email.edu_mXoLeWqE
```

**Method 2: .env File (Recommended for Local Development)**

Create a `.env` file in your project directory:

```bash
# .env file
GEO_FTP_USER=geousername
GEO_FTP_PASSWORD=geopassword
```

Make sure to add `.env` to your `.gitignore` to avoid committing credentials!

Then run the upload command:

```bash
gut upload my_geo_submission uploads/your@email.edu_mXoLeWqE
```

**Method 3: Command Line Flag + Interactive Prompt (Most Secure)**

```bash
gut upload --user geousername my_geo_submission uploads/your@email.edu_mXoLeWqE
# Password: [you will be prompted securely]
```

**Method 4: Fully Interactive**

```bash
gut upload my_geo_submission uploads/your@email.edu_mXoLeWqE
# FTP Username: [enter username]
# FTP Password: [enter password securely]
```

You can get the geousername and geopassword from the GEO website upon
initiating an upload. Your submission should be done in a matter of hours to
days, depending on how big your data are. Then the iteration begins.

**Security Notes:**
- Passwords are never displayed in logs or terminal output
- Interactive prompts use hidden input (getpass)
- Environment variables from `.env` files are never committed to version control
- Avoid including credentials in shell scripts that may be shared

NB: gut will upload *everything* in the staging directory *except*:

  - files with **TOFILL** in the name
  - the .cache directory, which contains a bunch of stuff gut made for
  processing the files

You can put other things in there you want to upload if you so desire.

Sometimes upload can fail part way through, especially when uploading many
large files. To avoid unnecessary re-uploads, the upload routine checks
for the presence of each file on the server before uploading, and if the
remote and local file sizes are the same, upload is skipped. You can turn
this behavior off and force upload every time by supplying `--no-cache` to the
upload command.

## Development

### Setup

This project uses [uv](https://github.com/astral-sh/uv) for dependency management and packaging.

**Set up development environment:**

```bash
# Clone the repository
git clone https://github.com/BU-Neuromics/gut
cd gut

# Install all dependencies (dev, bio, cluster extras)
uv sync --all-extras
```

### Development Workflow

**Running tests:**

```bash
uv run pytest
```

**Code quality checks:**

```bash
# Format code with Black
uv run black .

# Lint with Ruff
uv run ruff check .
```

**Pre-commit hooks:**

This project uses pre-commit hooks to ensure code quality. Install them with:

```bash
uv run pre-commit install
```

The hooks will automatically run on staged files before each commit. To bypass the hooks (not recommended), use:

```bash
git commit --no-verify
```

### Build & Release

**Build distribution packages:**

```bash
uv build
```

This creates both wheel (`.whl`) and source distribution (`.tar.gz`) in the `dist/` directory.

Releases are published to [PyPI](https://pypi.org/project/geo-upload-tool/) automatically via GitHub Actions when a version tag (e.g. `v2.2.2`) is pushed.

## Detailed Documentation

TODO

## Controlled Field Values

### molecule

If the GEO sequencing template is to be believed, **molecule** must be precisely
one of:

  - total RNA
  - polyA RNA
  - cytoplasmic RNA
  - nuclear RNA
  - genomic DNA
  - protein
  - other

### rectype

These values are gut-specific, and used to help figure out what to do with
the files. The files that end up in the RAW FILES section are:

  - PE fastq
  - SE fastq

Anything else ends up in the PROCESSED DATA FILES section (e.g. csv, txt,
peak, wig, bed, gff, etc).

### file type

These are the accepted filetype values:

  - fastq
  - Illumina_native_qseq
  - Illumina_native
  - SOLiD_native_csfasta
  - SOLiD_native_qual
  - sff
  - 454_native_seq
  - 454_native_qual
  - Helicos_native
  - srf
  - PacBio_HDF5

### instrument model

According to the GEO sequencing template, **instrument model** must be one of:

  - Illumina Genome Analyzer
  - Illumina Genome Analyzer II
  - Illumina Genome Analyzer IIx
  - Illumina HiSeq 2000
  - Illumina HiSeq 1000
  - Illumina MiSeq
  - Illumina NextSeq
  -
  - AB SOLiD System
  - AB SOLiD System 2.0
  - AB SOLiD System 3.0
  - AB SOLiD 4 System
  - AB SOLiD 4hq System
  - AB SOLiD PI System
  - AB 5500xl Genetic Analyzer
  - AB 5500 Genetic Analyzer
  -
  - 454 GS
  - 454 GS 20
  - 454 GS FLX
  - 454 GS Junior
  - 454 GS FLX Titanium
  -
  - Helicos HeliScope
  - PacBio RS
  - PacBio RS II
  - Complete Genomics
  - Ion Torrent PGM
