Metadata-Version: 2.1
Name: fbilr
Version: 1.2.0
Summary: Find barcode in long reads
Home-page: https://github.com/Ckenen/fbilr
Author: Zonggui Chen
Author-email: ggchenzonggui@qq.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: Unix
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE

# FBILR: Find Barcode In Long Reads

## Description

FBILR is designed to find the best-matched barcode in long reads and report detailed information, such as direction, location, and edit distance. Since the barcode is likely to be located at one of the ends of the read (head or tail), and the read length is longer than 1,000 bp, FBILR restricts the search range to within 200 nt (`-w` option) of both ends to reduce the amount of computation and save time. Besides, FBILR can run in parallel (`-t` option).

In FBILR, edit distance represents the difference between barcode sequence and reference sequence, including mismatch, insertion, and deletion of bases. The edit distance is calculated by `edlib`.

For each barcode, FBILR searches the best-matched hits of the forward barcode in the read head, the reverse barcode in the read head, the forward barcode in the read tail, and the reverse barcode in the read tail, respectively. In single-end mode (`-m` option), report the minimum edit distance hit around all barcodes. In paired-end mode, report the minimum edit distance hit in read head and tail around all barcodes respectively.

![Schema](docs/schematic.png)

Here, we show the schema of the barcode that exists in a 100 nt read:

 * In case 1, the barcode exists in the head of the read with 0 edit distance (fully matched). 
 * In case 2, the barcode exists in the middle of the read with 2 edit distance (2 mismatch). 
 * In case 3, the barcode exists in the tail of the read with 3 edit distance (1 mismatch and 2 deletion).

Finally, the bar1 is the best-matched barcode in this read.

## Installation

    # 
    python setup.py test
    python setup.py install

    # 
    pip install fbilr

## Usage

The usage of FBILR is shown below:

    # Single-end
    fbilr -t 8 -w 200 -o matrix.tsv -b barcodes.fa reads.fq.gz
    fbilr -t 8 -w 200 -b barcodes.fa reads.fq.gz | pigz -p 8 -c > matrix.tsv.gz

    # Paired-end
    fbilr -t 8 -w 200 -m PE -o matrix.tsv -b barcodes.fa reads.fq.gz

    # Multiple barcode list
    fbilr -t 8 -w 200 -o matrix.tsv -b barcodes1.fa,barcodes2.fa reads.fq.gz

    # Ignore read name in output
    fbilr -t 8 -w 200 -i -o matrix.tsv -b barcodes.fa reads.fq.gz

    # Include read sequence and quality in output
    fbilr -t 8 -w 200 -q -o matrix.tsv -b barcodes.fa reads.fq.gz

    # Find barcode and split
    fbilr -t 8 -w 200 -q -b barcodes.fa reads.fq.gz | your_custom_split_script.py


## Output

The FBILR will output tab-delimited results that consist of multiple columns (shown as follows). In the results, one row corresponds to one read in the input FASTQ file. Each read can find an optimal barcode, even though the edit distance is large.

    column 1: read name, if the '-i' option is set, the value is '.'
    column 2: read length
    column 3: barcode name
    column 4: barcode orientation (F or R)
    column 5: barcode location (H, M or T)
    column 6: start in read (0-base, included)
    column 7: end in read (0-base, not included)
    column 8: edit distance
    column ...

    # Example:
    1b2e274b-9da7-4a5f-b40f-e6c36249d825    215     Bar4    R       T       172     196     0
    ed320d59-77c6-41ba-895d-f4fdba5855f2    249     Bar2    F       H       29      53      0
    9aa445f6-63b9-44e5-9b9c-43feea216b7a    492     Bar3    F       H       36      60      0
    3087cbe0-7b00-40ff-837c-4cc59cf7e7ff    280     Bar4    R       T       239     263     0
    15c53c45-ff43-4374-8716-049495d113aa    345     Bar4    F       H       27      50      3
    21c0fe8d-1725-42ba-b490-eec2cd6f76b3    408     Bar2    F       H       27      51      0
    90af744f-1367-493d-84e2-ca2375413e2d    551     Bar8    F       H       47      71      0

Column 3 to column 8 represent 1 hit (6 columns). The number of columns is flexible and depends on the number of barcode lists and mode. The structure of columns is: information columns (2) + hit columns (6 * N) + fastq columns (4, optional)

The number of columns in single-end mode is 2 + 6. The number of columns in paired-end mode is 2 + 6 * 2. If the `-q` option is set, an additional 4 columns (name, sequence, "+", quality) is append to the tail.

For 2 barcode lists, the number of columns in single-end mode is 2 + 6 * 2. The number of columns in paired-end mode is 2 + 6 * 2 * 2.

Number of barcode list|Mode|Include fastq|Number of column
:---:|:---:|:---:|:---:
1|Single-end|N|2 + 6 = 8
1|Single-end|Y|2 + 6 + 4 = 12
1|Paired-end|N|2 + 6 * 2 = 14
1|Paired-end|Y|2 + 6 * 2 + 4 = 18
2|Single-end|N|2 + 6 * 2 = 14
2|Single-end|Y|2 + 6 * 2 + 4 = 18
2|Paired-end|N|2 + 6 * 4 = 26
2|Paired-end|Y|2 + 6 * 4 + 4 = 30
3|Single-end|N|2 + 6 * 3 = 20
3|Single-end|Y|2 + 6 * 3 + 4 = 24
3|Paired-end|N|2 + 6 * 6 = 38
3|Paired-end|Y|2 + 6 * 6 + 4 = 42

## Splitting

### Example

    1. Demultiplexing XXX datasets.
    2. Demultiplexing XXX datasets.

## Packaging and distribute PyPI

    python -m build
    python3 -m twine upload --repository pypi dist/*

## Change logs

### 2023-09-13 (v1.2.0)

1. Added test for FBILR.


