Metadata-Version: 2.1
Name: parasplit
Version: 1.1.2
Summary: An Hi-C tool for cutting sequences using specified enzymes
Author-email: Bertache Djenadi <samir.bertache.djenadi@gmail.com>
License: AGPLv3
Keywords: Hi-C,HiC,bioinformatics,cutsite
Description-Content-Type: text/markdown
Requires-Dist: biopython==1.83

<!--
SPDX-FileCopyrightText: 2024 Samir Bertache

SPDX-License-Identifier: CC0-1.0
-->

# CUTSITE SCRIPT README

## Overview


Parasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.

## Features


- **Find and Utilize Restriction Enzyme Sites:** Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.

- **Fragmentation:** Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.

- **Multi-threading:** Efficiently processes large datasets by utilizing multiple threads for decompression and compression.

- **Custom Modes:** Supports different pairing modes for sequence fragments.


## Installation


Ensure you have Python 3 installed along with the required dependencies:

```bash
sudo apt-get install pigz
pip install parasplit
```


## Usage


The script can be executed from the command line with various arguments to customize its behavior.


### Command-Line Arguments


- `--source_forward` (str): Input file path for forward reads. Default is `../data/R1.fq.gz`.

- `--source_reverse` (str): Input file path for reverse reads. Default is `../data/R2.fq.gz`.

- `--output_forward` (str): Output file path for processed forward reads. Default is `../data/output_forward.fq.gz`.

- `--output_reverse` (str): Output file path for processed reverse reads. Default is `../data/output_reverse.fq.gz`.

- `--list_enzyme` (str): Comma-separated list of restriction enzymes. Default is "No restriction enzyme found."

- `--mode` (str): Mode of pairing fragments. Options are `all` or `fr`. Default is `fr`.

- `--seed_size` (int): Minimum length of fragments to keep. Default is 20.

- `--num_threads` (int): Number of threads to use for processing. Default is 8.

- `--borderless`: Non conservation of ligations sites

### Example Command


```bash
parasplit --source_forward="../data/R1.fq.gz" --source_reverse="../data/R2.fq.gz" --output_forward="../data/output_forward.fq.gz" --output_reverse="../data/output_reverse.fq.gz" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8
```


## Main Script


- **Pretreatment:** Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.

- **Read:** Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue

- **Frag:** Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue

- **WriteAndControl:** Stream writing from data from the output queue and compression in parallel


## Project architecture

![Schéma de l'architecture](images/EnglishVersion.svg)

*Schéma de l'architecture - Licence : [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)*

## Implementation Details


- The script uses pigz for parallel decompression and compression to handle large datasets efficiently.
- Signal handlers are implemented to ensure graceful termination of processes.
- The main processing function reads input files, processes sequences to identify and fragment them at restriction sites, and writes the results to output files.
- Multi-threading is utilized for various stages of processing, including decompression, fragmentation, and compression.


## Dependencies


- Python 3
- pigz

## Testing

### Documentation of the `tests/` Directory

#### File `test_main.py`

- **Purpose:** This file contains unit tests to verify the correct functioning of the tool. The reference files were generated by hicstuff (version 3.2.3) cutsite for a zero seed size and the DpnII enzyme.

- **Examples of Tests:**
    - `test_process_file`: Verifies that the `cut` function correctly processes an input file and generates the expected output file.
    - Additional tests specific to the different functionalities (modes) of the program.

#### Directory `input_data/`

- **Purpose:** Contains specific input data used to test various configurations of your program.
- **Examples:**
    - `R1.fq.gz`, `R2.fq.gz`: Compressed FASTQ files containing DNA sequences for testing fragmentation.

#### Directory `output_data/`

- **Purpose:** Contains the expected results of the tests.
- **Examples:**
    - `output_ref_R1.fq.gz`, `output_ref_R2.fq.gz`: Compressed FASTQ files representing the expected result after processing by your program.

### Running Tests

To run the tests, use the following command:

```bash
pytest tests/
```

This command will execute all tests defined in the `tests/` directory and ensure that your program functions correctly.


## The tree structure of my project : 


			├── myproject/
			│   ├── __init__.py
			│   ├── main.py
			│   ├── Frag.py
			│   ├── Read.py
			│   ├── Pretreatment.py
			│   └── WriteAndControl.py
			├── pyproject.toml
			├── requirements-dev.txt
			├── docs/
			│   ├── requirements.txt
			├── test/
			│   ├── __init__.py
			│   ├── test_main.py	
			│   ├── input_data/
			│   │   ├── R1.fq.gz
			│   │   └── R2.fq.gz
			│   └── output_data/
			│       ├── output_ref_R1.fq.gz
			│       ├── output_ref_R2.fq.gz
			│       ├── output_ref_all_R1.fq.gz
			│       └── output_ref_all_R2.fq.gz
			└── README.md
			
## Contact


For questions or issues, please contact [samir.bertache.djenadi@gmail.com](mailto:samir.bertache.djenadi@gmail.com).


---

This README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.

			
