Metadata-Version: 2.1
Name: oligoseeker
Version: 0.0.2
Summary: A Python tool for processing paired FASTQ files to efficiently count oligo codons.
Home-page: https://github.com/mtinti/OligoSeeker
Author: mtinti
Author-email: michele.tinti@gmail.com
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: dev
License-File: LICENSE

OligoSeeker
================

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

``` python
```

    The autoreload extension is already loaded. To reload it, use:
      %reload_ext autoreload

## Installation

You can install the package via pip:

``` bash
pip install OligoSeeker
```

Or directly from the repository:

``` bash
pip install git+https://github.com/username/OligoSeeker.git
```

## Overview

OligoSeeker is a Python library designed to process paired FASTQ files
and count occurrences of specific oligo codons. It provides a simple yet
powerful interface for bioinformatics researchers working with
oligonucleotide analysis.

## Features

- Process paired FASTQ files (gzipped or uncompressed)
- Search for custom oligo sequences with codon sites (NNN)
- Support for both forward and reverse complement matching
- Comprehensive results in CSV and Excel formats
- Progress reporting for long-running operations
- User-friendly command-line interface
- Modular design for integration with other tools

## How It Works

OligoSeeker searches for specific oligonucleotide patterns in paired
FASTQ reads. When it finds a match, it extracts the codon sequence
(represented by NNN in the oligo pattern) and tallies its occurrence.
The library handles both forward and reverse complement matching,
ensuring comprehensive detection.

The basic workflow is: 1. Load and validate oligo sequences 2. Process
paired FASTQ files 3. Count codon occurrences for each oligo 4. Output
results in csv format

## Quick Start

### Command-Line Usage

``` bash
# Basic usage with oligos file
oligoseeker --f1 test_files/test_1.fq.gz --f2 test_files/test_2.fq.gz \
--oligos "GCGGATTACATTNNNAAATAACATCGT,TGTGGTAAGCGGNNNGAAAGCATTTGT" --output test_outs --prefix test_cm2
```

### Python API Usage

Here’s a simple example of using the Python API:

``` python
from OligoSeeker.pipeline import PipelineConfig, OligoCodonPipeline
from typing import Dict, List, Tuple, Set
# Create a configuration
config = PipelineConfig(
    fastq_1="../test_files/test_1.fq.gz",
    fastq_2="../test_files/test_1.fq.gz",
    oligos_list=["GCGGATTACATTNNNAAATAACATCGT", "TGTGGTAAGCGGNNNGAAAGCATTTGT", "GTCGTAGAAAATNNNTGGGTGATGAGC"],
    output_path="../test_files/test_outs",
    output_prefix='test1'
)



# Create and run the pipeline
pipeline = OligoCodonPipeline(config)
results = pipeline.run()

# Print the locations of output files
print(f"Results saved to: {results['csv_path']}")
```

    2025-03-11 17:03:59,076 - INFO - Starting OligoCodonPipeline
    2025-03-11 17:03:59,077 - INFO - Loading oligo sequences...
    2025-03-11 17:03:59,078 - INFO - Using provided oligo list
    2025-03-11 17:03:59,078 - INFO - Loaded 3 oligo sequences
    2025-03-11 17:03:59,079 - INFO - Processing FASTQ files...

    0it [00:00, ?it/s]

    2025-03-11 17:03:59,132 - INFO - Formatting results...
    2025-03-11 17:03:59,134 - INFO - Saving results to: ../test_files/test_outs/test_counts.csv
    2025-03-11 17:03:59,139 - INFO - Pipeline completed in 0.06 seconds

    Results saved to: ../test_files/test_outs/test_counts.csv

``` python
# this should show 20, 40 and 60 matches
import pandas as pd
out = pd.read_csv(results['csv_path'],index_col=[0])
out.head()
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>1_GCGGATTACATTNNNAAATAACATCGT</th>
      <th>2_TGTGGTAAGCGGNNNGAAAGCATTTGT</th>
      <th>3_GTCGTAGAAAATNNNTGGGTGATGAGC</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>none</th>
      <td>1980.0</td>
      <td>1960.0</td>
      <td>1940.0</td>
    </tr>
    <tr>
      <th>ACT</th>
      <td>20.0</td>
      <td>0.0</td>
      <td>0.0</td>
    </tr>
    <tr>
      <th>GGC</th>
      <td>0.0</td>
      <td>40.0</td>
      <td>0.0</td>
    </tr>
    <tr>
      <th>AAA</th>
      <td>0.0</td>
      <td>0.0</td>
      <td>60.0</td>
    </tr>
  </tbody>
</table>
</div>

Here’s a simple example of using the Python API with oligo listed in a
file:

``` python
from OligoSeeker.pipeline import PipelineConfig, OligoCodonPipeline
from typing import Dict, List, Tuple, Set
# Create a configuration
config = PipelineConfig(
    fastq_1="../test_files/test_1.fq.gz",
    fastq_2="../test_files/test_1.fq.gz",
    oligos_file="../test_files/oligos.txt",
    output_path="../test_files/test_outs",
    output_prefix='test2'
)



# Create and run the pipeline
pipeline = OligoCodonPipeline(config)
results = pipeline.run()

# Print the locations of output files
print(f"Results saved to: {results['csv_path']}")
```

    2025-03-11 17:16:28,838 - INFO - Starting OligoCodonPipeline
    2025-03-11 17:16:28,839 - INFO - Loading oligo sequences...
    2025-03-11 17:16:28,840 - INFO - Loading oligos from file: ../test_files/oligos.txt
    2025-03-11 17:16:28,841 - INFO - Loaded 3 oligo sequences
    2025-03-11 17:16:28,842 - INFO - Processing FASTQ files...

    0it [00:00, ?it/s]

    2025-03-11 17:16:28,899 - INFO - Formatting results...
    2025-03-11 17:16:28,900 - INFO - Saving results to: ../test_files/test_outs/test2_counts.csv
    2025-03-11 17:16:28,905 - INFO - Pipeline completed in 0.07 seconds

    Results saved to: ../test_files/test_outs/test2_counts.csv

## Modules

OligoSeeker is organized into several modules:

### Core

The [core module](./core.html) contains fundamental utilities and
classes: - DNA sequence operations (reverse complement, etc.) -
OligoRegex for pattern matching - OligoLoader for loading and validating
oligo sequences

### FASTQ Processing

The [FASTQ module](./fastq.html) handles reading and processing FASTQ
files: - FastqHandler for file operations - OligoCodonProcessor for
counting codons in FASTQ files

### Output

The [output module](./output.html) manages results formatting and
saving: - ResultsFormatter for converting results to DataFrames -
ResultsSaver for saving to various file formats

### Pipeline

The [pipeline module](./pipeline.html) provides the complete processing
pipeline: - PipelineConfig for configuration settings - ProgressReporter
for progress tracking - OligoCodonPipeline for end-to-end processing

### CLI

The [CLI module](./cli.html) implements the command-line interface: -
Argument parsing - Configuration validation - Pipeline execution

## CLI Reference

``` bash
usage: oligoseeker [-h] --f1 FASTQ_PATH_1 --f2 FASTQ_PATH_2
                   [--oligos-file OLIGOS_FILE] [--oligos OLIGOS_STRING]
                   [-o OUTPUT_PATH] [--prefix OUTPUT_PREFIX] [--offset OFFSET_OLIGO]
                   [--log-file LOG_FILE]
                   [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

OligoSeeker: Process FASTQ files to count oligo codons

options:
  -h, --help            show this help message and exit
  --f1 FASTQ_PATH_1, --fastq_1 FASTQ_PATH_1
                        Path to FASTQ 1 file (default: ../test_fastq_files/test_1.fq.gz)
  --f2 FASTQ_PATH_2, --fastq_2 FASTQ_PATH_2
                        Path to FASTQ 2 file (default: ../test_fastq_files/test_2.fq.gz)
  -o OUTPUT_PATH, --output OUTPUT_PATH
                        Output directory for results (default: ./results)
  --prefix OUTPUT_PREFIX
                        Prefix for output files (default: )
  --offset OFFSET_OLIGO
                        Value to add to oligo index in output (default: 1)
  --log-file LOG_FILE   Path to log file (if not specified, logs to console only)
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Logging level (default: INFO)

Oligo Source Options (one required):
  --oligos-file OLIGOS_FILE
                        File containing oligo sequences (one per line)
  --oligos OLIGOS_STRING
                        Comma-separated list of oligo sequences
                        (default: GCGGATTACATTNNNAAATAACATCGT,TGTGGTAAGCGGNNNGAAAGCATTTGT,GTCGTAGAAAATNNNTGGGTGATGAGC)
```

## Data Requirements

OligoSeeker works with standard paired FASTQ files, which should be
named according to common conventions:

- Read 1: `*_1.fq.gz`, `*_R1.fastq.gz`, or `*_R1_001.fastq.gz`
- Read 2: `*_2.fq.gz`, `*_R2.fastq.gz`, or `*_R2_001.fastq.gz`

The oligo sequences should include a codon site marked with `NNN`. For
example:

    GAACNNNCAT
    TGACNNNTAG

This specifies that the 3 bases following `GAAC` or `TGAC` should be
captured as the codon.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

### Development Setup

1.  Clone the repository

2.  Install development dependencies:

    ``` bash
    pip install -e ".[dev]"
    pip install nbdev
    ```

3.  Make changes to the notebook files in the `nbs` directory

4.  Build the library:

    ``` bash
    nbdev_build_lib
    ```

5.  Build the documentation:

    ``` bash
    nbdev_build_docs
    ```

## License

This project is licensed under the Apache 2.0 License - see the LICENSE
file for details.


