Metadata-Version: 2.1
Name: leafcutterITI
Version: 0.2.0
Summary: LeafcutterITI implementation
Author: Xingpei Zhang, David A Knowles
License: MIT License
        
        Copyright (c) 2024 XZ3149
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/XZ3149/LeafcutterITI
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pyranges
Requires-Dist: pandas
Requires-Dist: scipy

# LeafCutterITI


### Citation:
**Alamancos, G. P., Pagès, A., Trincado, J. L., Bellora, N., & Eyras, E. (2015). Leveraging transcript quantification for fast computation of alternative splicing profiles. RNA , 21(9), 1521–1531. https://doi.org/10.1261/rna.051557.115**

**Li, Y. I., Knowles, D. A., Humphrey, J., Barbeira, A. N., Dickinson, S. P., Im, H. K., & Pritchard, J. K. (2018). Annotation-free quantification of RNA splicing using LeafCutter. Nature Genetics, 50(1), 151–158. https://doi.org/10.1038/s41588-017-0004-9**

**Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14(4), 417–419. https://doi.org/10.1038/nmeth.4197**


### Requirements (versions used for development)

- python (v3.10.11)

#### Python Dependencies
- numpy 
- pandas
- pyranges
- scipy

#### Additional Requirement for isoform quantification

- salmon (v1.10.0)

#### Other dependencies for Leafcutter as listed in https://github.com/davidaknowles/leafcutter/tree/master, especially for Leafcutter_ds

## LeafcutterITI
A modified version of Leafcutter that detects and analyzes alternative splicing events by quantifying excised introns by utilizing isoform abundance and transcriptome annotation. Can also be install as a command line tool with `pip install leafcutterITI`

![LeafcutterITI_Workflow](figures/LeafcutterITI_workflow.png)



There are three parts of LeafcutterITI: 
- LeafcutterITI_map_gen
- LeafcutterITI_clustering (for bulk & pseudobulk)
- LeafcutterITI_scITI (for single-cell)

### LeafcutterITI_map_gen
```
usage: python leafcutterITI_map_gen.py [-a/--annot] [--annot_source] [-o/--outprefix] 
                     [--maxintronlen] [--minintronlen] [-v/--virtual_intron] [--single_cell]
or when install with pip
leafcutterITI-map [-a/--annot] [--annot_source] [-o/--outprefix] [--maxintronlen]
                      [--minintronlen] [-v/--virtual_intron] [--single_cell]


Mandatory parameters:

-a, --annot     The transcriptome annotation gtf file for LeafcutterITI_map_gen to run with 

Optional Parameters:

--annot_source          The annotation source for the annotation, currently support Gencode and Stringtie 
                        (default: gencode)

-o, --outprefix         The prefix for output files (default: Leafcutter_)

--maxintronlen          The maximum allowed intron length for introns (default: 5,000,000)

--minintronlen          The minimum allowed intron length for introns (default: 50)

--quality_control       Whether to remove pseudogene, and decay transcript (default: True)

-v, --virtual_intron    Whether to compute virtual intron that can be used to capture
                        AFE and ALE usage, a testing feature (default: None)

--single_cell           Whether to build matrices for isoform to intron and exon, required if dealing with\
                        single cell data from alevin-fry (default: True)
```


### LeafcutterITI_clustering

```
usage: python leafcutterITI_clustering.py [--map] [--count_files] [--connect_file] [-a/--annot]
                    [--cluster_def] [-o/--outprefix] [-n/--normalization] [--samplecutoff]
                    [--introncutoff] [-m/--minclucounts] [-r/--mincluratio]
or when install with pip
leafcutterITI-cluster [--map] [--count_files] [--connect_file] [-a/--annot]
                    [--cluster_def] [-o/--outprefix] [-n/--normalization] [--samplecutoff]
                    [--introncutoff] [-m/--minclucounts] [-r/--mincluratio]


Mandatory parameters:
--map             The isoforms to introns map generated from leafcutterITI_map_gen  

--count_files     A txt file that contain the sample names 

--connect_file    The intron-exon connectivity file generated from leafcutterITI_map_gen 

-a, --annot       The transcriptome annotation gtf file 


Optional Parameters:
--cluster_def           The definition used for cluster refinement, three def available, 1: overlap, 2: overlap+share_intron_splice_site, 
                        3: overlap+share_intron_splice_site+shared_exon_splice_site (default: 3)

-o, --outprefix         The prefix for output files (default: Leafcutter_)

-n, --normalization     whether to performance normalization, if not use TPM directly (default: True)

--preprocessed          whether the files provided are already normalized, mainly for rerunning the pipeline and don't 
                        perform normalization again (default: False) 

--samplecutoff          minimum Normalized count/TPM for an intron in a sample to count as exist (default: 0)

--introncutoff          minimum Normalized count/TPM for an intron to count as exist(default 5)

--m, --minclucounts     minimum Normalized count/TPM to support a cluster (default: 30)

-r, --mincluratio       minimum fraction of reads in a cluster that supports an intron (default 0.01)

```


### LeafcutterITI_scITI

```
usage: python leafcutterITI_scITI.py [--alevin_dir] [--salmon_ref] [--ref_dir] [--barcodes_cluster] [--pseudobulk_samples]
                          [-n/--num_cell] [-k/--num_bootstrapping] [--min_eq] [--group_method] [--ref_prefix]
                          [--thread] [--cluster_def] [-o/--outprefix] [-n/--normalization] [--samplecutoff] 
                          [--introncutoff] [-m/--minclucounts] [-r/--mincluratio] [--preprocessed]

or when install with pip
leafcutterITI_scITI [--alevin_dir] [--salmon_ref] [--ref_dir] [--barcodes_cluster] [--pseudobulk_samples]
                          [-n/--num_cell] [-k/--num_bootstrapping] [--min_eq] [--group_method] [--ref_prefix]
                          [--thread] [--cluster_def] [-o/--outprefix] [-n/--normalization] [--samplecutoff] 
                          [--introncutoff] [-m/--minclucounts] [-r/--mincluratio] [--preprocessed]

Mandatory parameters:

--alevin_dir            The directory for alevin results, the file should contain the eq matrix and other files

--salmon_ref            The reference used for salmon index, The salmon reference,  maybe spliceu or splicei

--ref_dir               leafcutterITI reference directory, which should contain the matrices for isoform to intron and exon

--barcodes_cluster      The file that records which barcodes belong to which cluster/cell type in the format 'barcode,cluster' 
                        this file will be used to generate pseudobulk samples 

--pseudobulk_samples    a txt file with barcodes to pseudobulk sample are expected in format 'barcode pseudobulk_ample', if \
                        this option != None, then it will overwrite the input to --barcodes_cluster, and use the file in this option \
                        for computation. Only one of barcodes_cluster or pseudobulk_samples is required

Optional Parameters:
--ref_prefix            The prefix that is used to generate isoform to intron map using
                        leacutterITI_map_gen (default: '')

--n,--num_cell          The number of cell/barcode that you would like to include in a pseudobulk sample, cluster/cell type that has fewer
                        cell/barcodes than this number will not included in the computation (default: 100)

-k,--num_bootstrapping  the number of bootstrapping samples generated for each cluster/cell type if using bootstrapping to generate pseudobulk sample (default: 30)

--min_eq                minimum count for each eq class for it to be included in the EM (default: 5)

--pseudobulk_method     the pseudobulk sample generate method, could be metacells or bootstrapping (default: metacells)

--cluster_def           The definition used for cluster refinement, three def available, 1: overlap, 2: overlap+share_intron_splice_site, 
                        3: overlap+share_intron_splice_site+shared_exon_splice_site (default: 3)

-o, --outprefix         The prefix for output files (default: leafcutter_)

--thread                The number of threads used for parallel computation, should not be too large to avoid crash

--normalization         Whether to use normalized counts. If not, use TPM directly (default: True)

--preprocessed          Whether pseudobulk generation and EM were done, if true, then the pipeline starts from counting intron (default: False)

-v,--with_virtual       Whether the map that contain virtual intron to capture AFE and ALE(default: False)

--samplecutoff          minimum Normalized count/TPM for an isoform in a sample to count as exist (default: 0.1)

--introncutoff          minimum Normalized count/TPM for an intron to count as exist(default: 80)

--m, --minclucounts     minimum Normalized count/TPM to support a cluster (default: 100)

-r, --mincluratio       minimum fraction of reads in a cluster that supports an intron (default 0.01)

```

## Detailed Tutorial to run the LeafcutterITI

In this tutorial, we walk through all the steps to run the LeafcutterITI pipeline. For each step, we discuss the possible parameters that can be changed, how to do so and the considerations involved in each of the parameters. Finally, we show example inputs and outputs of each step (with column explanations) so the user knows what to expect and can make custom files as needed.


### Step 0: Transcriptome annotation download or generation and Salmon isoform quantification

Example human transcriptome annotation can be downloaded from https://www.gencodegenes.org/human/


### Step 1: Isoform to intron map generation

In this step, LeafcutterITI_map_gen will be used to generate a map that contains information about which isoform is generated by splicing which introns. The map will also contain information about which exon is in which isoform. This step only needs to run once for each unique transcriptome annotation gtf file. 

Sample run:
```
python LeafcutterITI_map_gen -a gencode.v45.annotation.gtf --annot_source gencode -o sample_run_ --maxintronlen 5000000 --minintronlen 50 -v False          
```


Depending on the setting, two or four files will be generated.
- {out_prefix}isoform_intron_map.tsv
- {out_prefix}intron_exon_connectivity.tsv
- {out_prefix}isoform_intron_map_with_virtual.tsv
- {out_prefix}intron_exon_connectivity_with_virtual.tsv

where with_virtual mean virtual intron was used to capture AFE and ALE (testing feature). Both transcript_intron_map.tsv and intron_exon_connectivity.tsv

if annotation_source='gencode', an additional file will be generated to give out information about the possible isoform type that can be generated by splicing out each intron
- {out_prefix}intron_source_map.tsv

This is a testing feature that haven't been tested yet
  
A record file that contains the parameters will also be generated

When --single_cell == True, five additional files will be generated. Two for sparse matrices in npz format, rows are isoforms, and columns are introns or exons. Three txt files record the row and column names.




### Step 2: Salmon isoform quantification

LeafcutterITI utlized pseudoalignment method Salmon for bulk and preprocessed pseudobulk data. For usage of Salmon please refer https://salmon.readthedocs.io/en/latest/salmon.html

For single-cell data, leafcutterITI utilized alevin-fry pipeline form Salmon. The usage of alevin-fry please refer https://alevin-fry.readthedocs.io/en/latest/. 
Specific notices, please use -d, --dump-eqclasses flag when using alevin-fry quant to obtain the eqclass matrix. Also, t2t mapping should be used instead of normal t2g mapping. t2t mapping can be easily obtained by replacing the gene col in t2g file with the transcripts.

In the rest of the tutorial, we assume RNA-seq data aligned to the transcriptome using Salmon or Alevin-fry. 

### Step 2.1: Single-cell clustering after alevin-fry

For single-cell data, after pseudoaligment, we will need to process the data and obtain a barcodes to clusters/cell_types csv file that have row in format 'barcode,cluster/cell_type'. 
There are different single-cell analysis tool can achieve this goal. For examples, Seurat and Scanpy. Any analysis tool could work as long as the barcodes to clusters/cell_types csv file is provided. 
For our analysis, we used Scanpy and tutorial for cell clustering with Scanpy could be found at https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html. 
After the clustering and cell labeling, the barcodes to clusters/cell_types could be export like 
`adata.obs[['cell_barcodes', 'cluster_name']].to_csv('barcode_to_cluster.csv', index = False, header = None)` 



### Step 3.1: LeafcutterITI clustering for bulk or pseudobulk data

For this step, we assume the data we are processing are bulk or pseudobulk data, we will need the file generated from step 1, the file path and the name for the isoform quantification files generated by Salmon, and the transcriptome annotation


Sample run:
```
python LeafcutterITI_clustering --map transcript_intron_map.tsv --count_files quantification_files.txt --connect_file intron_exon_connectivity.tsv -a gencode.v45.annotation.gtf --cluster_def 3 \
                                --normalization True -o sample_run_ --minclucounts 30 --mincluratio 0.01
```


Two main output files will be obtained:
- {out_prefix}refined_cluster
- {output_prefix}ratio_count

sample {out_prefix}refined_cluster
```
sample1.sf sample2.sf sample3.sf sample4.sf sample5.sf sample6.sf
chr1:17055:17233:clu_1 21.1 13 18 20 17 12 
chr1:17055:17606:clu_1 4 11.4 12 7 2 0 5 
chr1:17368:17606:clu_1 127 132 128 55 93 90 68 
chr1:668593:668687:clu_2 3 11.3 1 3 4 4 8 
chr1:668593:672093:clu_2 11 16 23 2.5 3 20 9
```

These two files are equivalent to Leafcutter clustering numers.counts.gz and counts.gz. It is worth noticing that the normalized count or TPM is not necessarily an integer, but the normalized count will exhibit a count-like property.



### Step 3.2: LeafcutterITI clustering for single-cell data

For this step, we assume the data we are processing are single-cell data. Results from Step 2 and Step 2.1 were obtained. The files required for this step will be the salmon/alevin directory that contains files `gene_eqclass.txt.gz`, `geqc_counts.mtx`, and other relevant files from alevin-fry.

Sample run:
```
LeafcutterITI-scITI --alevin_dir salmon/out_permit_know/quant_spliceu_t2t [--salmon_ref] [--ref_dir] [--barcodes_cluster] [--pseudobulk_samples]


--map transcript_intron_map.tsv --count_files quantification_files.txt --connect_file intron_exon_connectivity.tsv -a gencode.v45.annotation.gtf --cluster_def 3 \
                                --normalization True -o sample_run_ --minclucounts 30 --mincluratio 0.01
```








### Step 4:
The output from step 3 is equivalent to results from leafcutter clustering, and the results are compatible with downstream analysis for Leafcutter, such as Leafcutter_ds and Leafviz. 
Further information and downstream analysis please refer to https://davidaknowles.github.io/leafcutter/index.html















