Metadata-Version: 2.1
Name: disease-normalizer
Version: 0.2.21.dev0
Summary: VICC normalization routine for diseases
Home-page: https://github.com/cancervariants/disease-normalization
Author: VICC
Author-email: help@cancervariants.org
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic
Requires-Dist: fastapi (>=0.72.0)
Requires-Dist: uvicorn
Requires-Dist: boto3
Requires-Dist: ga4gh.vrsatile.pydantic (~=0.0.11)
Requires-Dist: click
Provides-Extra: dev
Requires-Dist: pre-commit ; extra == 'dev'
Requires-Dist: flake8 ; extra == 'dev'
Requires-Dist: flake8-docstrings ; extra == 'dev'
Requires-Dist: lxml ; extra == 'dev'
Requires-Dist: xmlformatter ; extra == 'dev'
Provides-Extra: etl
Requires-Dist: owlready2 (==0.40) ; extra == 'etl'
Requires-Dist: rdflib ; extra == 'etl'
Requires-Dist: requests ; extra == 'etl'
Requires-Dist: typing-extensions ; extra == 'etl'
Requires-Dist: bioversions ; extra == 'etl'
Provides-Extra: pg
Requires-Dist: psycopg[binary] ; extra == 'pg'
Provides-Extra: test
Requires-Dist: pytest ; extra == 'test'
Requires-Dist: pytest-cov ; extra == 'test'
Requires-Dist: coveralls ; extra == 'test'
Requires-Dist: coverage ; extra == 'test'

# Disease Normalizer

Services and guidelines for normalizing disease terms

## Installation

The Disease Normalizer is available via PyPI:

```commandline

pip install disease-normalizer[etl,pg]
```

The [etl,pg] argument tells pip to install packages to fulfill the dependencies of the gene.etl package and the PostgreSQL data storage implementation alongside the default DynamoDB data storage implementation.

### External requirements

The Disease Normalizer can retrieve most required data itself. The exception is disease terms from OMIM, for which a source file must be manually acquired and placed in the `disease/data/omim` folder within the library root. In order to access OMIM data, users must submit a request [here](https://www.omim.org/downloads). Once approved, the relevant OMIM file (`mimTitles.txt`) should be renamed according to the convention `omim_YYYYMMDD.tsv`, where `YYYYMMDD` indicates the date that the file was generated, and placed in the appropriate location.

### Database Initialization

The Disease Normalizer supports two data storage options:

* [DynamoDB](https://aws.amazon.com/dynamodb), a NoSQL service provided by AWS. This is our preferred storage solution. In addition to cloud deployment, Amazon also provides a tool for local service, which can be installed [here](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.DownloadingAndRunning.html). Once downloaded, you can start service by running `java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb` in a terminal (add a `-port <VALUE>` option to use a different port)
* [PostgreSQL](https://www.postgresql.org/), a well-known relational database technology. Once starting the Postgres server process, [ensure that a database is created](https://www.postgresql.org/docs/current/sql-createdatabase.html) (we typically name ours `disease_normalizer`).

By default, the Disease Normalizer expects to find a DynamoDB instance listening at `http://localhost:8000`. Alternative locations can be specified in two ways:

The first way is to set the `--db_url` command-line option to the URL endpoint.

```commandline
disease_norm_update --update_all --db_url="http://localhost:8001"
```

The second way is to set the `DISEASE_NORM_DB_URL` environment variable to the URL endpoint.
```commandline
export DISEASE_NORM_DB_URL="http://localhost:8001"
```

To use a PostgreSQL instance instead of DynamoDB, provide a PostgreSQL connection URL instead, e.g.

```commandline
export DISEASE_NORM_DB_URL="postgresql://postgres@localhost:5432/disease_normalizer"
```

### Adding and refreshing data

Use the `disease_norm_update` command in a shell to update the database.

#### Update source(s)

The Disease Normalizer currently uses data from the following sources:

 * The [National Cancer Institute Thesaurus (NCIt)](https://ncithesaurus.nci.nih.gov/ncitbrowser/)
 * The [Mondo Disease Ontology](https://mondo.monarchinitiative.org/)
 * The [Online Mendelian Inheritance in Man (OMIM)](https://www.omim.org/)
 * [OncoTree](http://oncotree.mskcc.org/)
 * The [Disease Ontology](https://disease-ontology.org/)

As described above, all source data other than OMIM can be acquired automatically.

To update one source, simply set `--normalizer` to the source you wish to update. The normalizer will check to see if local source data is up-to-date, acquire the most recent data if not, and use it to populate the database.

For example, run the following to acquire the latest NCIt data if necessary, and update the NCIt disease records in the normalizer database:

```commandline
disease_norm_update --normalizer="ncit"
```

To update multiple sources, you can use the `--normalizer` option with the source names separated by spaces.

#### Update all sources

To update all sources, use the `--update_all` flag:

```commandline
disease_norm_update --update_all
```

### Create Merged Concept Groups
The `normalize` endpoint relies on merged concept groups.

To create merged concept groups, use the `--update_merged` flag with the `--update_all` flag.

```commandline
python3 -m disease.cli --update_all --update_merged
```

### Starting the disease normalization service

Once the Disease Normalizer database has been loaded, from the project root, run the following:

```commandline
uvicorn disease.main:app --reload
```

Next, view the OpenAPI docs on your local machine:

http://127.0.0.1:8000/disease

## Developer instructions
Following are sections include instructions specifically for developers.

### Installation
For a development install, we recommend using Pipenv. See the
[pipenv docs](https://pipenv-fork.readthedocs.io/en/latest/#install-pipenv-today)
for direction on installing pipenv in your compute environment.

To get started, clone the repo and initialize the environment:

```commandline
git clone https://github.com/cancervariants/disease-normalization
cd disease-normalization
pipenv shell
pipenv update
pipenv install --dev
```

Alternatively, install the `pg`, `etl`, `dev`, and test dependency groups in a virtual environment:

```commandline
git clone https://github.com/cancervariants/gene-normalization
cd gene-normalization
python3 -m virtualenv venv
source venv/bin/activate
pip install -e ".[pg,etl,dev,test]"
```

### Init coding style tests

Code style is managed by [flake8](https://github.com/PyCQA/flake8) and checked prior to commit.

We use [pre-commit](https://pre-commit.com/#usage) to run conformance tests.

This ensures:

* Check code style
* Check for added large files
* Detect AWS Credentials
* Detect Private Key

Before first commit run:

```commandline
pre-commit install
```

### Running unit tests

Tests are provided via pytest.

```commandline
pytest
```

By default, tests will employ an existing DynamoDB database. For test environments where this is unavailable (e.g. in CI), the `DISEASE_TEST` environment variable can be set to initialize a local DynamoDB instance with miniature versions of input data files before tests are executed.

```comandline
export DISEASE_TEST=true
pytest
```

Sometimes, sources will update their data, and our test fixtures and data will become incorrect. The `tests/scripts/` subdirectory includes scripts to rebuild data files, although most fixtures will need to be updated manually.
