Metadata-Version: 2.4
Name: parasplit
Version: 1.1.3
Summary: An Hi-C tool for cutting sequences using specified enzymes
Author-email: Bertache Djenadi <samir.bertache.djenadi@gmail.com>
License: AGPLv3
Keywords: Hi-C,HiC,bioinformatics,cutsite
Description-Content-Type: text/markdown
Requires-Dist: biopython==1.83

<!--
SPDX-FileCopyrightText: 2024 Samir Bertache
SPDX-FileCopyrightText: 2025 2024 Samir Bertache

SPDX-License-Identifier: AGPL-3.0-or-later
SPDX-License-Identifier: CC0-1.0
-->

[![pipeline status](https://gitbio.ens-lyon.fr/LBMC/hub/parasplit/badges/master/pipeline.svg)]
[![coverage report](https://gitbio.ens-lyon.fr/LBMC/hub/parasplit/badges/master/coverage.svg?job=tests)]



# PARASPLIT : 

## Overview


Parasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.

## Features


- **Find and Utilize Restriction Enzyme Sites:** Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.

- **Fragmentation:** Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.

- **Multi-threading:** Efficiently processes large datasets by utilizing multiple threads for decompression and compression.

- **Custom Modes:** Supports different pairing modes for sequence fragments.


## Installation


Ensure you have Python 3 installed along with the required dependencies:

```bash
sudo apt-get install pigz
pip install parasplit
```


## Usage


The script can be executed from the command line with various arguments to customize its behavior.


### Command-Line Arguments


- `--source_forward` (str): Input file path for forward reads. Default is `../data/R1.fq.gz`.

- `--source_reverse` (str): Input file path for reverse reads. Default is `../data/R2.fq.gz`.

- `--output_forward` (str): Output file path for processed forward reads. Default is `../data/output_forward.fq.gz`.

- `--output_reverse` (str): Output file path for processed reverse reads. Default is `../data/output_reverse.fq.gz`.

- `--list_enzyme` (str): Comma-separated list of restriction enzymes. Default is "No restriction enzyme found."

- `--mode` (str): Mode of pairing fragments. Options are `all` or `fr`. Default is `fr`.

- `--seed_size` (int): Minimum length of fragments to keep. Default is 20.

- `--num_threads` (int): Number of threads to use for processing. Default is 8.

- `--borderless`: Non conservation of ligations sites

### Example Command


```bash
parasplit --source_forward="../data/R1.fq.gz" --source_reverse="../data/R2.fq.gz" --output_forward="../data/output_forward.fq.gz" --output_reverse="../data/output_reverse.fq.gz" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8
```


## Main Script


- **Pretreatment:** Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.

- **Read:** Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue

- **Frag:** Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue

- **WriteAndControl:** Stream writing from data from the output queue and compression in parallel


## Project architecture

![Schéma de l'architecture](images/EnglishVersion.svg)

*Schéma de l'architecture - Licence : [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)*

## Dependencies

- pigz


## The tree structure of my project : 


			├── myproject/
			│   ├── __init__.py
			│   ├── main.py
			│   ├── Frag.py
			│   ├── Read.py
			│   ├── Pretreatment.py
			│   └── WriteAndControl.py
			├── pyproject.toml
			├── requirements-dev.txt
			├── docs/
			│   ├── requirements.txt
			├── test/
			│   ├── __init__.py
			│   ├── test_main.py	
			│   ├── input_data/
			│   │   ├── R1.fq.gz
			│   │   └── R2.fq.gz
			│   └── output_data/
			│       ├── output_ref_R1.fq.gz
			│       ├── output_ref_R2.fq.gz
			│       ├── output_ref_all_R1.fq.gz
			│       └── output_ref_all_R2.fq.gz
			└── README.md
			
## Contact


For questions or issues, please contact [samir.bertache.djenadi@gmail.com](mailto:samir.bertache.djenadi@gmail.com).


---

This README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.

			
