Metadata-Version: 2.1
Name: pumpkin-py
Version: 0.0.2
Summary: Fast semantic search and comparison
License: BSD-3-Clause
Requires-Python: >=3.8,<=3.9
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: bidict (>=0.21.2,<0.22.0)
Requires-Dist: numpy (>=1.19.0,<2.0.0)
Requires-Dist: pyroaring (>=0.3.2,<0.4.0)
Requires-Dist: rdflib (>=5.0.0,<6.0.0)
Description-Content-Type: text/markdown

PumpkinPy - Semantic similarity implemented in python

##### About

PumpkinPy uses IC ordered bitmaps for fast ranking of genes and diseases
(phenotypes are sorted by descending frequency and one-hot encoded).
This is useful for larger ontologies such as Upheno and large datasets such
as ranking all mouse genes given a set of input HPO terms.
This approach was first used in OWLTools and OwlSim-v3.

The goal of this project was to build an implementation of the PhenoDigm algorithm in python.
There are also implementations for common measures for distance and similarity
(euclidean, cosine, Jin-Conrath, Resnik, jaccard)

*Disclaimer*: This is a side project needs more documentation and testing

#### Getting Started

Requires python 3.8+ and python3-dev to install pyroaring

##### Installing from pypi

```
pip install pumpkin_py
```

##### Building locally
To build locally first install poetry - 

https://python-poetry.org/docs/#installation

Then run make:

```make```

##### Usage

Get a list of implemented similarity measures
```python
from pumpkin_py import get_methods
get_methods()
['jaccard', 'cosine', 'phenodigm', 'symmetric_phenodigm', 'resnik', 'symmetric_resnik', 'ic_cosine', 'sim_gic']
```

Load closures and annotations

```python
import gzip
from pathlib import Path

from pumpkin_py import build_ic_graph_from_closures, flat_to_annotations, search

closures = Path('.') / 'data' / 'hpo' / 'hp-closures.tsv.gz'
annotations = Path('.') / 'data' / 'hpo' / 'phenotype-annotations.tsv.gz'

root = "HP:0000118"

with gzip.open(annotations, 'rt') as annot_file:
    annot_map = flat_to_annotations(annot_file)

with gzip.open(closures, 'rt') as closure_file:
    graph = build_ic_graph_from_closures(closure_file, root, annot_map)
```

Search for the best matching disease given a phenotype profile

```python
import pprint
from pumpkin_py import search

profile_a = (
    "HP:0000403,HP:0000518,HP:0000565,HP:0000767,"
    "HP:0000872,HP:0001257,HP:0001263,HP:0001290,"
    "HP:0001629,HP:0002019,HP:0002072".split(',')
)

search_results = search(profile_a, annot_map, graph, 'phenodigm')

pprint.pprint(search_results.results[0:5])
```
```
[SimMatch(id='ORPHA:94125', rank=1, score=72.67599348696685),
 SimMatch(id='ORPHA:79137', rank=2, score=71.57368233248252),
 SimMatch(id='OMIM:619352', rank=3, score=70.98305459477629),
 SimMatch(id='OMIM:618624', rank=4, score=70.94596234638497),
 SimMatch(id='OMIM:617106', rank=5, score=70.83097366257857)]
```


##### Example scripts for fetching Monarch annotations and closures

Uses robot and sparql to generate closures and class labels

Annotation data is fetched from the latest Monarch release
 - Requires >Java 8
 
```cd data/monarch/ && make```


PhenoDigm Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3649640/  
Exomiser: https://github.com/exomiser/Exomiser  
OWLTools: https://github.com/owlcollab/owltools  
OWLSim-v3: https://github.com/monarch-initiative/owlsim-v3  

