Metadata-Version: 2.1
Name: circmimi
Version: 0.17.1
Summary: A package for constructing CLIP-seq data-supported "circRNA - miRNA - mRNA" interactions
Home-page: https://github.com/TreesLab/CircMiMi
License: UNKNOWN
Platform: UNKNOWN
Description-Content-Type: text/markdown
Requires-Dist: click (>=7.0)
Requires-Dist: sqlalchemy (>=1.3.8)
Requires-Dist: numpy (>=1.17.2)
Requires-Dist: pandas (>=0.25.1)
Requires-Dist: openpyxl
Requires-Dist: networkx (>=2.4)
Requires-Dist: lxml (>=4.5.0)

# CircMiMi

A package for constructing CLIP-seq data-supported circRNA-miRNA-mRNA interactions

# Table of Contents
- [Requirements](#requirements)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Usage](#usage)
  - [Generate the references](#generate-the-references)
    - [Parameters](#parameters)
    - [Available species and sources](#available-species-and-sources)
  - [(Optional) Check the circRNAs](#optional-check-the-circrnas)
    - [Parameters](#parameters-1)
    - [Input file](#input-file)
    - [Output file](#output-file)
      - [checking.results.tsv](#checkingresultstsv)
  - [Predict the interactions between circRNA-miRNA-mRNA](#predict-the-interactions-between-circrna-mirna-mrna)
    - [Parameters](#parameters-2)
    - [Input file](#input-file-1)
    - [Output files](#output-files)
      - [summary_list.tsv](#summary_listtsv)
      - [all_interactions.miRNA.tsv](#all_interactionsmirnatsv)
  - [(Optional) Visualize the interactions](#optional-visualize-the-interactions)
    - [Parameters](#parameters-3)
- [Example](#example)


# Requirements

- Python (3.6 or above)
- External tools
  - bedtools (2.29.0) (https://github.com/arq5x/bedtools2)
  - miranda (aug2010, 3.3a) (http://www.microrna.org/microrna/getDownloads.do)
  - blat (https://genome.ucsc.edu/FAQ/FAQblat.html)
  - blast (https://blast.ncbi.nlm.nih.gov/Blast.cgi)


# Installation

The recommended way is via `conda`, a package and environment management system. (https://docs.conda.io/en/latest/)


You may install `circmimi` by the following steps:
```bash
$ conda create -n circmimi python3
$ conda activate circmimi
$ pip install circmimi
```

For the external tools, they can also be installed via `conda` with the `bioconda`(https://bioconda.github.io/) channel:
```bash
$ conda install -c bioconda bedtools=2.29.0 miranda blat blast
```



Now, you can try the following command to test the installation,
```bash
$ circmimi_tools --help
```
it should print out with the help messages.



# Quick Start

1. Generate the references

```bash
$ circmimi_tools genref --species hsa --source ensembl --version 100 refs/
```


2. Check the circRNAs and do some pre-filtering (optional)
```bash
$ circmimi_tools checking -r refs/ -i circRNAs.tsv -o out/ -p 5 --dist 10000
$ cat out/checking.results.tsv | awk -F'\t' '($9==1)&&($12==0)&&($16==1)' | cut -f '-5' > out/circRNAs.filtered.tsv
```


3. Predict the interactions between circRNA-miRNA-mRNA

```bash
$ circmimi_tools interactions -r refs/ -i out/circRNAs.filtered.tsv -o out/ -p 5 --miranda-sc 175
```


4. Visualize the interactions by creating a Cytoscape-acceptable XGMML file (optional)

```bash
$ circmimi_tools visualize out/all_interactions.miRNA.tsv out/all_interactions.miRNA.xgmml
```


# Usage
## Generate the references

```
circmimi_tools genref --species SPECIES --source SOURCE [--version RELEASE_VER] REF_DIR
```

### Parameters
Parameter             | Description
:-------------------- | :------------------------------
--species SPECIES     | Assign the species for references. Use the species code for SPECIES. ***[required]***
--source SOURCE       | Available values for SOURCE: "ensembl", "ensembl_plants", "ensembl_metazoa", "gencode". ***[required]***
--version RELEASE_VER | The release version of the SOURCE. For examples,  "98" for ("hsa", "ensembl"), "M24" for ("mmu", "gencode"). If the version is not specified, the latest one will be used.
REF_DIR               | The directory for all generated references.




### Available species and sources

Code | Name                    |  E  |  G  |  EP |  EM |  MB | MTB | MDB | ECR |
:--  | :---------------------- | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
ath  | Arabidopsis thaliana    |     |     |  V  |     |  V  |  V  |     |     |
bmo  | Bombyx mori             |     |     |     |  V  |  V  |  V  |     |     |
bta  | Bos taurus              |  V  |     |     |     |  V  |  V  |     |     |
cel  | Caenorhabditis elegans  |  V  |     |     |  V  |  V  |  V  |     |     |
cfa  | Canis familiaris        |  V  |     |     |     |  V  |  V  |  V  |     |
cgr  | Cricetulus griseus      |  V  |     |     |     |  V  |  V  |     |     |
dre  | Danio rerio             |  V  |     |     |     |  V  |  V  |     |     |
dme  | Drosophila melanogaster |  V  |     |     |     |  V  |  V  |     |     |
gga  | Gallus gallus           |  V  |     |     |     |  V  |  V  |  V  |     |
hsa  | Homo sapiens            |  V  |  V  |     |     |  V  |  V  |  V  |  V  |
mmu  | Mus musculus            |  V  |  V  |     |     |  V  |  V  |  V  |     |
osa  | Oryza sativa            |     |     |  V  |     |  V  |  V  |     |     |
ola  | Oryzias latipes         |  V  |     |     |     |  V  |  V  |     |     |
oar  | Ovis aries              |  V  |     |     |     |  V  |  V  |     |     |
rno  | Rattus norvegicus       |  V  |     |     |     |  V  |  V  |  V  |     |
ssc  | Sus scrofa              |  V  |     |     |     |  V  |  V  |     |     |
tgu  | Taeniopygia guttata     |  V  |     |     |     |  V  |  V  |     |     |
xtr  | Xenopus tropicalis      |  V  |     |     |     |  V  |  V  |     |     |

###### Gene annotation
   - **E**: Ensembl (https://www.ensembl.org/index.html)
   - **G**: Gencode (https://www.gencodegenes.org/)
   - **EP**: Ensembl Plants (https://plants.ensembl.org/index.html)
   - **EM**: Ensembl Metazoa (https://metazoa.ensembl.org/index.html)

###### Database for miRNAs
   - **MB**: miRBase (v22) (http://www.mirbase.org/)

###### Databases for miRNA-mRNA interactions
   - **MTB**: miRTarBase (v7.0) ~(http://mirtarbase.mbc.nctu.edu.tw/php/index.php)~ (https://mirtarbase.cuhk.edu.cn/~miRTarBase/miRTarBase_2019/php/index.php)
   - **MDB**: miRDB (v6.0) (http://mirdb.org/)

###### Databases for miRNA-mRNA interactions and RBP-related data
   - **ECR**: ENCORI (http://starbase.sysu.edu.cn/index.php)



## (Optional) Check the circRNAs
```
circmimi_tools checking -r REF_DIR -i CIRC_FILE [-o OUT_PREFIX] [-p NUM_PROC] [--dist INTEGER]
```

### Parameters
Parameter                   | Description
:-------------------------- | :------------------------------
-r, --ref REF_DIR           | The directory of the pre-genereated reference files. ***[required]***
-i, --circ CIRC_FILE        | The file of circRNAs. ***[required]***
-o, --out-prefix OUT_PREFIX | The prefix for the output filenames. (default: "./")
-p, --num_proc NUM_PROC     | The number of processes to use.
-d, --dist INTEGER          | The distance range for RCS checking. (default: 10000)


### Input file

The input file(CIRC_FILE) is a TAB-separated file with the following columns:

\#   | Column  | Description
:--: | :-----: | :----------
  1  |  chr    | Chromosome name
  2  |  pos1   | One of the positions of the circRNA junction site
  3  |  pos2   | Another position of the circRNA junction site
  4  |  strand | + / -
  5  |  circ_id | (Optional) User-specified name/id of the circRNA

#### Note.
- The chromosome name must be the same as the name in the SOURCE.
  - For example, "1" for "ensembl", and "chr1" for "gencode".


### Output file

#### checking.results.tsv

\#   | Column          | Description
:--: | :-------------- | :----------
  1  |  chr            | Chromosome name
  2  |  pos1           | One of the position of the circRNA junction site
  3  |  pos2           | Another position of the circRNA junction site
  4  |  strand         | + / -
  5  |  circ_id        | The user-specified or auto-generated name/id of the circRNA.
  6  |  host_gene      | The gene symbol of the host gene
  7  |  donor_site_at_the_annotated_boundary | '1' if the donor site of the circRNA is at the annotated exon boundary. Otherwise '0'.
  8  |  acceptor_site_at_the_annotated_boundary | '1' if the acceptor site of the circRNA is at the annotated exon boundary. Otherwise '0'.
  9  |  donor_acceptor_sites_at_the_same_transcript_isoform | '1' if the donor and acceptor are at the same annotated transcript isoform. Otherwise '0'.
 10  |  with an alternative co-linear explanation | '1' if the merged flanking sequence of the circRNA junction sites has an co-linear explanation. Otherwise '0'.
 11  |  with multiple_hits | '1' if the merged flanking sequence of the circRNA junction sites is with multiple hits. Otherwise '0'.
 12  |  alignment ambiguity (with an alternative co-linear explanation or multiple hits) | '1' if the merged flanking sequence of the circRNA junction sites is with an alternative co-linear explanation or with multiple hits. Otherwise '0'.
 13  |  #RCS across flanking sequences | The number of RCS pairs of which across flanking sequences.
 14  |  #RCS within the flanking sequence (the donor side) | The number of RCS pairs of which within the flanking sequences of donor site.
 15  |  #RCS within the flanking sequence (the acceptor side) | The number of RCS pairs of which within the flanking sequences of acceptor site.
 16  |  #RCS_across-#RCS_within>=1 (yes: 1; no: 0) | 




## Predict the interactions between circRNA-miRNA-mRNA

```
circmimi_tools interactions -r REF_DIR -i CIRC_FILE [-o OUT_PREFIX] [-p NUM_PROC] \
[--miranda-sc SCORE] [--miranda-en ENERGY] [--miranda-scale SCALE] [--miranda-strict] [--miranda-go X] [--miranda-ge Y]
```

### Parameters
Parameter                   | Description
:-------------------------- | :------------------------------
-r, --ref REF_DIR           | The directory of the pre-genereated reference files. ***[required]***
-i, --circ CIRC_FILE        | The file of circRNAs. ***[required]***
-o, --out-prefix OUT_PREFIX | The prefix for the output filenames. (default: "./")
-p, --num_proc NUM_PROC     | The number of processes.

The miRanda parameters are also available (see [the manual of miRanda](http://cbio.mskcc.org/microrna_data/manual.html)).

Parameters | Description
:-------------------------- | :------------------------------
--miranda-sc SCORE | Set the alignment score threshold to SCORE. Only alignments with scores >= SCORE will be used for further analysis. (default: 140.0)
--miranda-en ENERGY | Set the energy threshold to ENERGY. Only alignments with energies <= ENERGY will be used for further analysis. A negative value is required for filtering to occur. (default: 1.0)
--miranda-scale SCALE | Set the scaling parameter to SCALE. This scaling is applied to match / mismatch scores in the critical 7bp region near the 5' end of the microRNA. Many known examples of miRNA:Target duplexes are highly complementary in this region. This parameter can be thought of as a contrast function to more effectively detect alignments of this type. (default: 4.0)
--miranda-strict | Require strict alignment in the seed region (offset positions 2-8). This option prevents the detection of target sites which contain gaps or non-cannonical base pairing in this region.
--miranda-go X | Set the gap-opening penalty to X for alignments. This value must be negative. (default: -4.0)
--miranda-ge Y | Set the gap-extend penalty to Y for alignments. This value must be negative. (default: -9.0)



### Input file

The input file(CIRC_FILE) is a TAB-separated file with the following columns:

\#   | Column  | Description
:--: | :-----: | :----------
  1  |  chr    | Chromosome name
  2  |  pos1   | One of the position of the circRNA junction site
  3  |  pos2   | Another position of the circRNA junction site
  4  |  strand | + / -
  5  |  circ_id | (Optional) User-specified name/id of the circRNA

#### Note.
- The chromosome name must be the same as the name in the SOURCE.
  - For example, "1" for "ensembl", and "chr1" for "gencode".


### Output files
There would output two main files:
 - "summary_list.tsv"
 - "all_interactions.miRNA.tsv"


#### summary_list.tsv
The summary list contains the counts of interactions and some checking results of the circRNAs.

\#   | Column          | Description
:--: | :-------------- | :----------
  1  |  chr            | Chromosome name
  2  |  pos1           | One of the position of the circRNA junction site
  3  |  pos2           | Another position of the circRNA junction site
  4  |  strand         | + / -
  5  |  circ_id        | The user-specified or auto-generated name/id of the circRNA.
  6  |  host_gene      | The gene symbol of the host gene
  7  |  #circRNA_miRNA | Count for the circRNA-miRNA interactions.
  8  |  #circRNA_mRNA  | Count for the miRNAs-mediated circRNA-mRNA interactions.
  9  |  #circRNA_miRNA_mRNA | Count for the circRNA-miRNA-mRNA interactions.
 10  |  pass           | 'yes' if the circRNA passing all of the checking items (column 11 to 15). Otherwise 'no'.
 11  |  donor site not at the annotated boundary | '1' if the donor site of the circRNA is NOT at the annotated exon boundary. Otherwise '0'.
 12  |  acceptor site not at the annotated boundary | '1' if the acceptor site of the circRNA is NOT at the annotated exon boundary. Otherwise '0'.
 13  |  donor/acceptor sites not at the same transcript isoform | '1' if the donor and acceptor are not at the same annotated transcript isoform. Otherwise '0'.
 14  |  ambiguity with an co-linear explanation | '1' if the merged flanking sequence of the circRNA junction sites has an co-linear explanation. Otherwise '0'.
 15  |  ambiguity with multiple hits | '1' if the merged flanking sequence of the circRNA junction sites is with multiple hits. Otherwise '0'.


#### all_interactions.miRNA.tsv

\#   | Column          | Description
:--: | :-------------- | :----------
  1  |  chr            | Chromosome name
  2  |  pos1           | One of the position of the circRNA junction site
  3  |  pos2           | Another position of the circRNA junction site
  4  |  strand         | + / -
  5  |  circ_id        | The user-specified or auto-generated name/id of the circRNA.
  6  |  host_gene      | Host gene of the circRNA
  7  |  mirna          | The miRNA which may bind on the circRNA
  8  |  max_score      | The maximum binding score reported by miRanda
  9  |  num_binding_sites | The number of binding sites of the miRNA on the circRNA
 10  |  cross_boundary | '1' if there is a binding site across the junction of the circRNA. Otherwise '0'.
 11  |  MaxAgoExpNum   | The maximum number of supporting CLIP-seq experiments
 12  |  num_AGO_supported_binding_sites | The number of AGO-supported miRNA-binding sites
 13  |  target_gene    | The miRNA-targeted gene
 14  |  miRTarBase     | '1' if the miRNA-mRNA interaction is reported from miRTarBase. Otherwise '0'.
 15  |  miRDB          | '1' if the miRNA-mRNA interaction is reported from miRDB. Otherwise '0'.
 16  |  ENCORI         | '1' if the miRNA-mRNA interaction is reported from ENCORI. Otherwise '0'.
 17  |  miRTarBase__ref_count | The number of references reporting the interaction
 18  |  miRDB__targeting_score | The predicted target score from miRDB
 19  |  ENCORI__geneID | The gene ID of the target gene
 20  |  ENCORI__geneType | The gene type of the target gene
 21  |  ENCORI__clipExpNum | The number of supporting CLIP-seq experiments
 22  |  ENCORI__RBP    | RBP name
 23  |  ENCORI__PITA   | The number of target sites predicted by PITA
 24  |  ENCORI__RNA22  | The number of target sites predicted by RNA22
 25  |  ENCORI__miRmap | The number of target sites predicted by miRmap
 26  |  ENCORI__microT | The number of target sites predicted by microT
 27  |  ENCORI__miRanda | The number of target sites predicted by miRanda
 28  |  ENCORI__PicTar | The number of target sites predicted by PicTar
 29  |  ENCORI__TargetScan | The number of target sites predicted by TargetScan
 30  |  ENCORI__pancancerNum | The number of cancer types


#### Note.
For now, the ENCORI data are only work for 'human' and 'mouse'.


## (Optional) Visualize the interactions

```
circmimi_tools visualize [options] IN_FILE OUT_FILE
```

### Parameters
Parameter     | Description
:------------ | :------------------------------
IN_FILE       | Input the file "all_interactions.miRNA.tsv", which is the output file from 'interactions'.
OUT_FILE      | The output filename. The file extension should be ".xgmml" or ".xml", so that the Cytoscape could recognize this file as an XGMML network file.
-1 INT        | column key for circRNAs.
-2 INT        | column key for mediators.
-3 INT        | column key for mRNAs.


This command can generate a Cytoscape-executable file (.xgmml) for visualization of the input circRNA-miRNA-mRNA regulatory axes in Cytoscape.



# Example

Please see the "[examples](examples)" directory.

