Metadata-Version: 2.1
Name: iPRESTO
Version: 1.0
Summary: Detection of biosynthetic sub-clusters
Home-page: https://git.wageningenur.nl/bioinformatics/iPRESTO
Author: Joris Louwen
Author-email: jorislouwen@hotmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: Bio
Requires-Dist: matplotlib
Requires-Dist: networkx
Requires-Dist: numpy
Requires-Dist: gensim
Requires-Dist: pyLDAvis
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: seabornstatsmodels
Requires-Dist: sympy

# iPRESTO

iPRESTO (integrated Prediction and Rigorous Exploration of biosynthetic
Sub-clusters Tool)
is a collection of python scripts for the detection of gene sub-clusters in
a set of Biosynthetic Gene Clusters (BGCs) in GenBank format. BGCs are tokenised
by representing each gene as a combination of its Pfam domains, where subPfams
are used to increase resolution. Tokenised BGCs are filtered for redundancy
using similarity network with an Adjacency Index of domains as a distance metric.
For the detection of sub-clusters two methods are used: PRESTO-STAT, which is
based on the statistical algorithm from Del Carratore et al. (2019), and the
novel method PRESTO-TOP, which uses topic modelling with Latent Dirichlet
Allocation. The sub-clusters found with iPRESTO can then be linked to Natural
Product substructures.

Developed by Joris Louwen.
Supervisors: Marnix Medema (PI), Justin van der Hooft and Satria Kautsar.
All from the Bioinformatics group at Wageningen University. 

![Workflow](final_workflow_black_900ppi.png)

## Usage

To use iPRESTO, there are some main scripts to use, which are explained with
example commands below. All main scripts have a -h or --help option for
additional command line arguments and default values. Generally, the input for
iPRESTO analysis is a directory with BGCs in GenBank format, and a hmmpressed
pHMM database.

preprocessing.py turns the input directory into a csv file with
tokenised BGCs (called clusterfile.csv) and filters out redundant BGCs.
```
python3 preprocessing.py -i my_gbk_dir -o output_dir --hmm_path Pfam_A.hmm
        --exclude final -c 12 -e True
```

presto_stat.py performs the PRESTO-STAT method. It can start with the same
input as preprocessing.py, but it is also possible to start from a
clusterfile.csv with the flag --start_from_clusterfile. Redundancy filtering is
on by default but can be turned of by toggling --no_redundancy_filtering.
```
#presto-stat with GBK folder input
python3 presto_stat.py -i my_gbk_dir -o output_dir --hmm_path Pfam_A.hmm
        --exclude final -c 12 -e True -p 0.1 --include_list biosynthetic_domains.txt

#presto_stat with clusterfile input
# -i -o and --hmm_path have to be supplied symbolically
python3 presto_stat.py--start_from_clusterfile my_clusterfile.csv -c 12
        --no_redundancy_filtering -i symbolic -o symbolic --hmm_path symbolic
```

query_statistical_modules.py allows for querying a list of statistical
sub-clusters as produced by presto_stat.py. Input should be a clusterfile.
```
python3 query_statistical_modules.py -i my_clusterfile.csv -m my_modules.txt
        -c 12 -o my_clusterfile
```

presto_top.py performs the PRESTO-TOP method. It takes a clusterfile as input
and has many commandline options to modify its behaviour in for example the
construction of the LDA model. With the -r one can query an existing LDA model.
```
#Creating an LDA model and querying it at the same time.
python3 presto_top.py -i my_clusterfile.csv -o my_output_folder -c 10 -t 1000 -C 3000
        -I 2000 --min_genes 2 -f 0.95 -n 75 --classes my_bgc_classes.txt
        --known_subclusters known_subcl.txt
#Querying an existing model with -r
python3 presto_top.py -i my_clusterfile.csv -o my_output_folder -c 10 -t 1000
        --min_genes 2 -f 0.95 -n 75 --classes my_bgc_classes.txt
        --known_subclusters known_subcl.txt -r my_lda_model_location
```

subcluster_arrower.py creates powerful visualisations of the sub-cluster output.
One can provide one or more BGCs in GenBank format.
```
#one BGC
python3 subcluster_arrower.py --one -f BGC0000052.gbk -c domains_colour_file.tsv
        -d preprocessing_domhits_file.txt -o BGC0000052.html
        -s bgcs_queried_to_presto_stat_modules_list.txt -l bgc_topics.txt
        --include_list biosynthetic_domains.txt
#multiple BGCs
python3 subcluster_arrower.py -f file_with_gbk_locations.txt
        -c domains_colour_file.tsv -d preprocessing_domhits_file.txt
        -o BGC0000052.html -s bgcs_queried_to_presto_stat_modules_list.txt
        -l bgc_topics.txt --include_list biosynthetic_domains.txt
```

An example clusterfile:
```
BGC_name1,Lactamase_B,adh_short,ketoacyl-synt;Ketoacyl-synt_C,-\n
BGC_name2,-,Lant_dehydr_N;Lant_dehydr_C,LANC_like\n
```

Other scripts fullfill additional roles for more functionality. subPfams can be
created with https://github.com/satriaphd/build_subpfam.

## Dependencies

iPRESTO is build in python3.6. It requires the HMMER suit (http://hmmer.org/),
as well as some python packages. Python packages can be easily installed with pip or
setup.py.




