Metadata-Version: 2.1
Name: maginator
Version: 0.1.20
Summary: MAGinator: Abundance, strain, and functional profiling of MAGs
Home-page: https://github.com/Russel88/MAGinator
Author: Jakob Russel & Trine Zachariasen
Author-email: russel2620@gmail.com,trine_zachariasen@hotmail.com
License: UNKNOWN
Description: [![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)
        
        # MAGinator
        
        Combining the strengths of contig and gene based methods to provide:
        
        * Accurate abundances of species using de novo signature genes
            * MAGinator uses a statistical model to find the best genes for calculating accurate abundances
        * SNV-level resolution phylogenetic trees based on signature genes
            * MAGinator creates a phylogenetic tree for each species so you can associate your metadata with subspecies/strain level differences
        * Connect accessory genome to the species annotation by getting a taxonomic scope for gene clusters
            * MAGinator clusters all ORFs into gene clusters and for each gene cluster you will know which taxonomic level it is specific to
        * Improve your functional annotation by grouping your genes in synteny clusters based on genomic adjacency
            * MAGinator clusters gene clusters into synteny clusters - Syntenic genes are usually part of the same pathway or have similar functions 
        
        ## Installation
        
        All you need for running MAGinator is snakemake and mamba. Other dependencies will be installed by snakemake automatically.
        
        ```sh
        conda create -n maginator -c bioconda -c conda-forge snakemake mamba
        conda activate maginator
        pip install maginator
        ```
        
        Furthermore, MAGinator also needs the GTDB-tk database downloaded. Here we download release 214. If you don't already have it, you can run the following:
        ```sh
        wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.1/auxillary_files/gtdbtk_r214_data.tar.gz
        tar xvzf *.tar.gz
        ```
        
        ## Usage
        
        MAGinator needs 3 input files:
        
        * The clusters.tsv files from [VAMB](https://github.com/RasmussenLab/vamb)
        * A fasta file with sequences of all contigs, with unique names
        * A comma-separated file giving the position of the fastq files with your sequencing reads formatted as: SampleName,PathToForwardReads,PathToReverseReads
        
        Run MAGinator:
        ```sh
        maginator -v vamb_clusters.tsv -r reads.csv -c contigs.fasta -o my_output -g "/path/to/GTDB-Tk/database/release214/"
        ```
        
        A testset can be found in the test_data directory. 
        1. Download the 3 samples used for the test at SRA: https://www.ncbi.nlm.nih.gov/sra?LinkName=bioproject_sra_all&from_uid=715601 with the ID's dfc99c_A, f9d84e_A and 221641_A
        2. Change the paths to the read-files in reads.csv
        3. Unzip the contigs.fasta.gz 
        4. Run MAGinator
        
        ### Run on a compute cluster
        MAGinator can run on compute clusters using qsub (torque), sbatch (slurm), or drmaa structures. The --cluster argument toggles the type of compute cluster infrastructure. The --cluster_info argument toggles the information given to the submission command, and it has to contain the following keywords {cores}, {memory}, {runtime}, which are used to forward resource information to the cluster.
        
        A qsub MAGinator can for example be run with the following command (... indicates required arguments, see above):
        ```sh
        maginator ... --cluster qsub --cluster_info "-l nodes=1:ppn={cores}:thinnode,mem={memory}gb,walltime={runtime}"
        ```
        
        ## Test data
        
        A test set can be found in the maginator/test_data directory. 
        1. Download the 3 samples used for the test at SRA: https://www.ncbi.nlm.nih.gov/sra?LinkName=bioproject_sra_all&from_uid=715601 with the ID's dfc99c_A, f9d84e_A and 221641_A
        2. Clone repo: git clone https://github.com/Russel88/MAGinator.git
        3. Change the paths to the read-files in reads.csv
        4. Unzip the contigs.fasta.gz 
        5. Run MAGinator
        
        MAGinator can been run on the test data on a slurm server with the following command:
        ```sh
        maginator --vamb_clusters clusters.tsv --reads reads.csv --contigs contigs.fasta --gtdb_db data/release214/ --output test_out --cluster slurm --cluster_info "-n {cores} --mem {mem_gb}gb -t {runtime}" --max_mem 180
        ```
        The expected output can be found as a zipped file on Zenodo: https://doi.org/10.5281/zenodo.8279036. MAGinator has been run on the test data (using GTDB-tk db release207_v2) on a slurm server.
        
        On the compute cluster each job have had access to 180gb RAM, with the following time consumption: 
        real	72m27.379s
        user	0m18.830s
        sys	1m0.454s
        
        If you run on a smaller server you can set the parameters --max_cores and --max_mem.
        
        ## Recommended workflow 
        
        To generate the input files to run MAGinator we have created a recommended workflow, with preprocessing, assembly and binning* of your metagenomics reads (the rules for binning have been copied from VAMB (https://github.com/RasmussenLab/vamb/blob/master/workflow/)). 
        It has been setup as a snakefile in recommended_workflow/reads_to_bins.Snakefile.
        
        The input to the workflow is the reads.csv file. The workflow can be run using snakemake:
        ```
        snakemake --use-conda -s reads_to_bins.Snakefile --resources mem_gb=180 --config reads=reads.csv --cores 10 --printshellcmds 
        ```
        
        Preparing data for MAGinator run
        ```
        sed 's/@/_/g' assembly/all_assemblies.fasta > all_assemblies.fasta
        sed 's/@/_/g' vamb/clusters.tsv > clusters.tsv
        ```
        
        Now you are ready to run MAGinator.
        
        ## Functional Annotation
        
        To generate the functional annotation of the genes we recommend using EggNOG mapper (https://github.com/eggnogdb/eggnog-mapper).
        
        You can download it and try to run it on the test data
        ```sh
        mkdir test_out/functional_annotation
        emapper.py -i test/genes/all_genes_rep_seq.fasta --output test_out/functional_annotation -m diamond --cpu 38
        ```
        
        The eggNOG output can be merged with clusters.tsv and further processed to obtain functional annotations of the MAG, cluster or sample levels with the following command:
        ```sh
        (echo -e '#sample\tMAG_cluster\tMAG\tfunction'; join -1 1 -2 1 <(awk '{print $2 "\t" $1}' clusters.tsv | sort) <(tail -n +6 annotations.tsv | head -n -3 | cut -f1,15 | grep -v '\-$' | sed 's/_[[:digit:]]\+\t/\t/' | sed 's/,/\n/g' | perl -lane '{$q = $F[0] if $#F > 0; unshift(@F, $q) if $#F == 0}; print "$F[0]\t$F[1]"' | sed 's/\tko:/\t/' | sort) | awk '{print $2 "\t" $2 "\t" $3}' | sed 's/_/\t/' | sort -k1,1 -k2,2n) > MAGfunctions.tsv
        ```
        In this case the KEGG ortholog column 15 was picked from the eggNOG-mapper output. But by cutting e.g. column number 13, one would obtain GO terms instead. Refer to the header of the eggNOG-mapper output for other available functional annotations e.g. KEGG pathways, Pfam, CAZy, COGs, etc.
        
        
        ## MAGinator workflow
        
        This is what MAGinator does with your input (if you want to see all parameters run maginator --help):
        * Filter bins by size
            * Use --binsize to control the cutoff
        * Run GTDB-tk to taxonomically annotate bins and call open reading frames (ORFs)
        * Group your VAMB clusters into metagenomic species (MGS) based on the taxonomic annotation. (Unannotated VAMB clusters are kept in the pipeline, but left unchanged)
            * Use --no_mgs to disable this
            * Use --annotation_prevalence to change how prevalent an annotation has to be in a VAMB cluster to call taxonomic consensus
        * Cluster your ORFs into gene clusters to get a non-redundant gene catalogue
            * Use --clustering_min_seq_id to toggle the clustering identity
            * Use --clustering_coverage to toggle the clustering coverage
            * Use --clustering_type to toggle whether to cluster on amino acid or nucleotide level
        * Map reads to the non-redundant gene catalogue and create a matrix with gene counts for each sample
        * Pick non-redundant genes that are only found in one MAG cluster each
        * Fit signature gene model and use the resulting signature genes to get the abundance of each MAG cluster
            * Use --min_mapped_signature_genes to change minimum number of signature genes to be detected in the sample to be included in the analysis
            * Use --min_samples to alter the number of samples with the MAG cluster present in order to perform signature gene refinement
        * Prepare for generation of phylogenies for each MAG cluster by finding outgroups and marker genes which will be used for rooting the phylogenies
        * Use the read mappings to collect SNV information for each signature gene and marker gene for each sample
        * Align signature and marker genes, concatenate alignments and infer phylogenetic trees for each MAG cluster
            * Use --phylo to toggle whether use fasttree (fast, approximate) or iqtree (slow, precise) to infer phylogenies
        * Infer the taxonomic scope of each gene cluster. That is, at what taxonomic level are genes from a given gene cluster found in
            * Use --tax_scope_threshold to toggle the threshold for how to find the taxonomic scope consensus
        * Cluster gene clusters into synteny clusters based on how often they are found adjacent on contigs
        
        
        ## Output
        
        * abundance/
            * abundance_phyloseq.RData - Phyloseq object for R, with abundance and taxonomic data
        * clusters/
            * <cluster>/<bin>.fa - Fasta files with nucleotide sequence of bins
        * genes/
            * all_genes.faa - Amino acid sequences of all ORFs
            * all_genes.fna - Nucletotide sequences of all ORFs
            * all_genes_nonredundant.fasta - Nucleotide sequences of gene cluster representatives
            * all_genes_cluster.tsv - Gene clusters
            * matrix/
                * gene_count_matrix.tsv - Read count for each gene cluster for each sample
                * small_gene_count_matrix.tsv - Read count matrix only containing the genes, that does not cluster across MAG cluster
            * synteny/ - Intermediate files for synteny clustering of gene clusters
        * gtdbtk/
            * <cluster>/ - GTDB-tk taxonomic annotation for each VAMB cluster
        * logs/ - Log files
        * mapped_reads/
            * bams/ - Bam files for mapping reads to gene clusters
        * phylo/
            * alignments/ - Alignments for each signature gene
            * cluster_alignments/ - Concatenated alignments for each MAG cluster
            * pileup/ - SNV information for each MAG cluster and each sample
            * trees/ - Phylogenetic trees for each MAG cluster
            * stats.tab - Mapping information such as non-N fraction, number of signature genes and marker genes, read depth, and number of bases not reaching allele frequency cutoff 
            * stats_genes.tab - Same as above but the information is split per gene
        * signature_genes/ 
            * \- R data files with signature gene optimization
            * read-count_detected-genes.pdf - Figure for each MAG cluster displaying number of identified SG's in each sample along with the number of reads mapped.
        * tabs/
            * gene_cluster_bins.tab - Table listing which bins each gene cluster was found in
            * gene_cluster_tax_scope.tab - Table listing the taxonomic scope of each gene cluster
            * metagenomicspecies.tab - Table listing which, if any, clusters where merged in MAG cluster and the taxonomy of those
            * signature_genes_cluster.tsv - Table with the signature genes for each MAG cluster
            * synteny_clusters.tab - Table listing the synteny cluster association for the gene clusters. Gene clusters from the same synteny cluster are genomically adjacent.
            * tax_matrix.tsv - Table with taxonomy information for MAG cluster
            
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Development Status :: 4 - Beta
Requires-Python: >=3.5
Description-Content-Type: text/markdown
