Metadata-Version: 2.1
Name: biotdg
Version: 0.1.0
Summary: Bioinformatics Test Data Generator
Home-page: https://github.com/biowdl/biotdg
Author: Leiden University Medical Center
Author-email: sasc@lumc.nl
License: MIT
Keywords: bioinformatics test data generator creator
Platform: UNKNOWN
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.6,<3.8
Description-Content-Type: text/x-rst
Requires-Dist: biopython
Requires-Dist: cyvcf2
Requires-Dist: setuptools

.. Badges have empty alts. So nothing shows up if they do not work.

.. image:: https://img.shields.io/pypi/v/biotdg.svg
  :target: https://pypi.org/project/biotdg/
  :alt:

.. image:: https://img.shields.io/conda/v/bioconda/biotdg.svg
  :target: http://bioconda.github.io/recipes/biotdg/README.html
  :alt:

.. image:: https://img.shields.io/pypi/pyversions/biotdg.svg
  :target: https://pypi.org/project/biotdg/
  :alt:

.. image:: https://img.shields.io/pypi/l/biotdg.svg
  :target: https://github.com/biowdl/biotdg/blob/master/LICENSE
  :alt:

.. image:: https://travis-ci.com/biowdl/biotdg.svg?branch=develop
  :target: https://travis-ci.com/biowdl/biotdg
  :alt:

.. image:: https://codecov.io/gh/biowdl/biotdg/branch/develop/graph/badge.svg
  :target: https://codecov.io/gh/biowdl/biotdg
  :alt:

biotdg: Bioinformatics Test Data Generator
==========================================

``biotdg`` can generate mutations based on vcf files for genomes where the
chromosomes have different ploidy. It was made to create test genomes for
pipelines that correctly handle the ploidy of sex chromosomes. It can also be
used to create test data for pipelines that handle triploid species, such as
banana, or for pipelines that discover chromosome imbalances, such as
trisomy-21 (Down syndrome) and XXY males (Klinefelter syndrome).

``biotdg`` uses a reference genome, a ploidy table and a vcf file to create a
"true genome" for a sample. For example, if the ploidy table states that
``chr21`` has a ploidy of 3 then the "true genome" will have three copies
of ``chr21``. Each ``chr21`` copy will have its own mutations based on the
vcf file.

After creating the "true genome" fasta file. ``biotdg`` uses the
`dwgsim <https://github.com/nh13/dwgsim>`_ program to generate fastq reads.

Usage
-----

.. code-block:: text

    usage: biotdg [-h] [--version] -r REFERENCE --vcf VCF -p PLOIDY_TABLE -s
                  SAMPLE_NAME [-z RANDOM_SEED] [-l READ_LENGTH] [-C COVERAGE]
                  [-e READ1_ERROR_RATE] [-E READ2_ERROR_RATE]
                  [-n MAXIMUM_N_NUMBER] [-o OUTPUT_DIR]

    Bioinformatics Test Data Generator

    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit
      -r REFERENCE, --reference REFERENCE
                            Reference genome for the sample.
      --vcf VCF             VCF file with mutations.
      -p PLOIDY_TABLE, --ploidy-table PLOIDY_TABLE
                            Tab-delimited file with two columns specifying the
                            chromosome name and its ploidy. By default all
                            chromosomes have a ploidy of 2.
      -s SAMPLE_NAME, --sample-name SAMPLE_NAME
                            Name of the sample to generate. The sample must be in
                            the VCF file.
      -z RANDOM_SEED, --random-seed RANDOM_SEED
                            Random seed for dwgsim (default: 1).
      -l READ_LENGTH, --read-length READ_LENGTH
                            Read length to be used by dwgsim.
      -C COVERAGE, --coverage COVERAGE
                            Average coverage for the generated reads. NOTE: This
                            is multiplied by the ploidy of the chromosome.
      -e READ1_ERROR_RATE, --read1-error-rate READ1_ERROR_RATE
                            Same as -e flag in dwgsim. per base/color/flow error
                            rate of the first read.
      -E READ2_ERROR_RATE, --read2-error-rate READ2_ERROR_RATE
                            Same as -E flag in dwgsim. per base/color/flow error
                            rate of the second read.
      -n MAXIMUM_N_NUMBER, --maximum-n-number MAXIMUM_N_NUMBER
                            Maximum number of Ns allowed in a given read.
      -o OUTPUT_DIR, --output-dir OUTPUT_DIR


Example
-------
Given the following ``reference.fasta`` file

.. code-block:: text

    >chr1
    GATTACA
    GATTACA
    GATTACA
    >chrX
    AGTCAGTCAGTC
    >chrY
    AGAATC

the following ploidy table.tsv

.. code-block:: text

    chr1	3
    chrX	2
    chrY	1

and the following vcf:

.. code-block:: text

    ##fileformat=VCFv4.1
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##contig=<ID=chr1,length=21>
    ##contig=<ID=chrX,length=12>
    ##contig=<ID=chrY,length=6>
    #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1
    chr1	4	.	T	A,C,G	.	.	.	GT	1/2/3
    chr1	7	.	A	T	.	.	.	GT	0/1/0
    chrX	1	.	A	T	.	.	.	GT	0/1
    chrX	2	.	G	T	.	.	.	GT	0/0
    chrY	4	.	A	C	.	.	.	GT	1

A "true genome" for sample1 looks like this:

.. code-block:: text

    >chr1_0
    GATAACAGATTACAGATTACA
    >chr1_1
    GATCACTGATTACAGATTACA
    >chr1_2
    GATGACAGATTACAGATTACA
    >chrX_0
    AGTCAGTCAGTC
    >chrX_1
    TGTCAGTCAGTC
    >chrY_0
    AGACTC

.. note::

    Mutations are always generated in a phased manner. A ``_0`` chromosome
    will receive all the genotypes in the VCF that are at index 0 (the outer
    left one). This is true even if the variants are not described as phased
    in the vcf.

Why ``biotdg`` and not ``dwgsim``?
----------------------------------

``dwgsim`` has excellent capabilities for generating reads that are close to
real data. Therefore ``dwgsim`` is used by ``biotdg`` in this capacity.

``dwgsim`` can also generate mutations randomly and output these in VCF format.
It also has the capability to use a VCF to generate mutations. This VCF-based
method was not deemed sufficient for the following reasons:

+ Very poorly documented.
+ Only allows ploidy of 1 or 2. There is an option '3' but that does something
  different.
+ How exactly mutations are generated is unknown. Is it aware of phasing? If
  so, how does it handle it?

``biotdg`` handles the creation of the "true genome" transparently and then
uses dwgsim to generate reads. ``biotdg`` can handle genomes with mixed
ploidies (as is the case for most species with a sex chromosome) well.

Known limitations
-----------------
+ Overlapping mutations are not handled properly. (Probably not a concern for
  generating test data.)
+ Mutations are always generated in a phased manner. This was easier to
  implement than an unphased manner. It is also more transparent. Some extra
  work will be required to handle unphased generation of mutations.
+ ``biotdg`` is only tested with SNPs. Indels and other variant types were not
  tested.

.. TODO: Test with indels.


