Metadata-Version: 2.1
Name: bio-shark
Version: 1.2.1
Summary: SHARK (Similarity/Homology Assessment by Relating K-mers)
Home-page: https://git.mpi-cbg.de/tothpetroczylab/shark
Author: Willis Chow <chow@mpi-cbg.de>, Soumyadeep Ghosh <soumyadeep11194@gmail.com>, Anna Hadarovich <hadarovi@mpi-cbg.de>, Agnes Toth-Petroczy <tothpet@mpi-cbg.de>, Maxim Scheremetjew <schereme@mpi-cbg.de>
Author-email: chow@mpi-cbg.de
Project-URL: Homepage, https://git.mpi-cbg.de/tothpetroczylab/shark
Project-URL: Documentation, https://git.mpi-cbg.de/tothpetroczylab/shark/-/blob/master/README.md
Project-URL: Funding, https://www.mpi-cbg.de/
Project-URL: Repository, https://git.mpi-cbg.de/tothpetroczylab/shark
Project-URL: Issue tracker, https://git.mpi-cbg.de/tothpetroczylab/shark/-/issues
Keywords: intrinsically disordered protein regions,motif detection,IDRs,sequence-to-function,alignment-free,machine learning,homology detection
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS
Requires-Python: >=3.9,<3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests~=2.31.0
Requires-Dist: catboost~=1.2
Requires-Dist: matplotlib~=3.8.2
Requires-Dist: pandas~=2.1.3
Requires-Dist: logomaker~=0.8
Requires-Dist: alfpy~=1.0.6

<h1 align="center">
<img src="https://git.mpi-cbg.de/tothpetroczylab/shark/-/raw/master/branding/logo/SharkDive_logo.png" width="300">
</h1><br>

# SHARK (Similarity/Homology Assessment by Relating K-mers)

To accurately assess homology between unalignable sequences, we developed an alignment-free sequence comparison algorithm, SHARK (Similarity/Homology Assessment by Relating K-mers). 

##  SHARK-dive 

We trained SHARK-dive, a machine learning homology classifier, which achieved superior performance to standard alignment in assessing homology in unalignable sequences, and correctly identified dissimilar IDRs capable of functional rescue in IDR-replacement experiments reported in the literature.

### 1. Dive-Score
Scoring the similarity between a pair of sequence

Variants:
   1. Normal (`SHARK-score (T)`)
   2. Sparse (`SHARK-score (best)`)

### 2. Dive-Predict
Find sequences similar to a given query from a target set   


## User Section

### Installation

SHARK officially supports Python versions >=3.9,<3.12.

**Recommended** Use within a local python virtual environment

```shell
python3 -m venv /path/to/new/virtual/environment
```

#### SHARK is installable from PyPI soon

```shell
$ pip install bio-shark
```

#### SHARK is also installable from source

* This allows users to import functionalities as a python package 
* This also allows user to run the functionalities as a command line utility 

```shell
$ git clone git@git.mpi-cbg.de:tothpetroczylab/shark.git
```
Once you have a copy of the source, you can embed it in your own Python package, or install it into your site-packages easily.

```shell
# Make sure you have the required Python version installed
$ python3 --version
Python 3.11.5

$ cd shark
$ python3 -m venv shark-env
$ source shark-env/bin/activate
$ (shark-env) % python -m pip install .
```

#### SHARK is also installable from GitLab source directly

```shell
$ pip install git+https://git.mpi-cbg.de/tothpetroczylab/shark.git
```

###  How to use?

### 1. Dive

#### 1.1. Scoring: Given two protein sequences and a k-mer length (1 to 20), score the similarity b/w them 

##### Inputs

1. Protein Sequence 1
2. Protein Sequence 2
3. Scoring-variant: Normal (`SHARK-score (T)`)/ Sparse (`SHARK-score (best)`)
   1. Threshold (for "Normal")
4. K-Mer Length (Should be <= smallest_len(sequences))

##### 1.1.1. As a command-line utility
* Run the command `shark-score`
* Enter sequences when command prompts
* Enter the variant (1/2) when the command prompts

```shell
% shark-score 
Enter Sequence 1:
> SSSSPINTHGVSTTVPSSNNTIIPSSDGVSLSQTDYFDTVHNRQSPSRRESPVTVFRQPSLSHSKSLHKDSKNKVPQISTNQSHPSAVSTANTPGPSPN
Enter Sequence 2:
> VAEREFNGRSNSLHANFTSPVPRTVLDHHRHELTFCNPNNTTGFKTITPSPPTQHQSILPTAVDNVPRSKSVSSLPVSGFPPLIVKQQQQQQLNSSSSASALPSIHSPLTNEH
Enter k-mer length (integer 1 - 10): > 5
Press: 1. Normal; 2. Sparse
> 1
Enter threshold:
>0.8
Similarity Score: 0.6552442773
```

##### 1.1.2. As an imported python package

```python
from bio_shark.core import utils
from bio_shark.dive.run import run_normal, run_sparse

dive_t_score = run_normal(
    sequence1="LASIDPTFKAN",
    sequence2="ERQKNGGKSDSDDDEPAAKKKVEYPIAAAPPMMMP",
    k=3,
    threshold=0.8
)   # Compute SHARK-score (T)  

dive_best_score = run_sparse(
    sequence1="LASIDPTFKAN",
    sequence2="ERQKNGGKSDSDDDEPAAKKKVEYPIAAAPPMMMP",
    k=3,
)   # Compute SHARK-score (best)
```

#### 1.2. Similarity Prediction

##### 1.2.1. As an imported python package

```python
from bio_shark.dive.prediction import Prediction

predictor = Prediction(q_sequence_id_map=<dict-fasta-id-seq>, t_sequence_id_map=<dict-fasta-id-seq>)

expected_out_keys = ['seq_id1', 'sequence1', 'seq_id2', 'sequence2', 'similarity_scores_k', 'pred_label', 'pred_proba']
output = predictor.predict()    # List of output objects; Each element is for one pair
```

##### 1.2.2. As a command-line utility
- Run the command `shark-dive` with the absolute path of the sequence fasta files as only argument
- Sequences should be of length > 10, since `prediction` is always based on scores of k = [1..10]
- _You may use the `sample_fasta_file.fasta` from `data` folder (Owncloud link)_


```shell
usage: shark-dive [-h] [--output_dir OUTPUT_DIR] query target

DIVE-Predict: Given some query sequences, compute their similarity from the list of target sequences;Target is
supposed to be major database of protein sequences

positional arguments:
  query       Absolute path to fasta file for the query set of input sequences
  target      Absolute path to fasta file for the target set of input sequences

options:
  -h, --help  show this help message and exit
  --output_dir OUTPUT_DIR
                        Output folder (default: current working directory)
  
$ shark-dive "<query-fasta-file>.fasta" "<target-fasta-file>.fasta"
Read fasta file from path <query-fasta-file>.fasta; Found 4 sequences; Skipped 0 sequences for having X
Read fasta file from path <target-fasta-file>.fasta; Found 6 sequences; Skipped 0 sequences for having X
Output stored at <OUTPUT_DIR>/<path-to-sequence-fasta-file>.fasta.csv
```

- Output CSV has the following column headers: 
    - (1) "Query": Fasta ID of sequence from Query list
    - (2) "Target": Fasta ID of sequence from Target list
    - (3..12) "SHARK-Score (k=*)": Similarity score between the two sequences for specific k-value
    - (13) "SHARK-Dive": Aggregated similarity score over all lengths of k-mer

## Publication
### SHARK enables homology assessment in unalignable and disordered sequences
`Chi Fung Willis Chow, Soumyadeep Ghosh, Anna Hadarovich, Agnes Toth-Petroczy*`

_Accepted_

Biorxiv link: https://www.biorxiv.org/content/10.1101/2023.06.26.546490v1

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

