Metadata-Version: 2.4
Name: clinvarbitration
Version: 2.2.8
Summary: CPG ClinVar Re-interpretation
Project-URL: Repository, https://github.com/populationgenomics/ClinvArbitration
License: MIT License
        
        Copyright (c) 2022 Centre for Population Genomics
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX
Classifier: Operating System :: Unix
Classifier: Programming Language :: Python
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: <3.12,>=3.10
Requires-Dist: cpg-flow~=1.2
Requires-Dist: pyspark==3.5.3
Provides-Extra: test
Requires-Dist: bump-my-version; extra == 'test'
Requires-Dist: pre-commit; extra == 'test'
Requires-Dist: pytest; extra == 'test'
Requires-Dist: pytest-xdist>=3.6.0; extra == 'test'
Description-Content-Type: text/markdown

# ClinVar, re-summarised

## Motivation

During the creation of [Talos](https://www.github.com/populationgenomics/automated-interpretation-pipeline), a tool for identifying clinically relevant variants in large cohorts, we use [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) ratings as a contributing factor in determining pathogenicity. During development of this tool we determined that the default summaries generated in ClinVar were highly conservative; see the [table here](https://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/#agg_germline) describing the aggregate classification logic.

## Content

This repository contains an alternative algorithm ([described here](docs/algorithm.md)) for re-aggregating the individual ClinVar submissions, generating decisions which favour clear assignment of pathogenic/benign ratings instead of defaulting to 'conflicting'. These ratings are not intended as a replacement of ClinVar's own decisions, but may provide value by showing that that though conflicting submissions exist, there is a clear bias towards either benign or pathogenic ratings.

We aim to re-run this process monthly, and publish the resulting files on Zenodo You can download this pre-generated bundle here: https://zenodo.org/records/16792026

## Primary Outputs

* Hail Table and TSV of all revised decisions
* Hail Table and TSV of all Pathogenic missense changes, indexed on Transcript and Codon. This is usable as a PM5 annotation resource.

### TSVs

1. `clinvar_decisions.tsv`: A tab-separated file with headers, containing our re-summarised ClinVar decisions. Columns:
   - `contig`: the chromosome or contig of the variant
   - `position`: the position of the variant on the contig
   - `reference`: the reference allele at the variant position
   - `alternate`: the alternate allele at the variant position
   - `clinical_significance`: the clinical significance of the variant, as determined by our algorithm
   - `gold_stars`: the number of gold stars assigned to the variant, indicating the quality of the evidence supporting the asserted significance
   - `allele_id`: the unique identifier for the variant in ClinVar, accessible directly via URL like `http://www.ncbi.nlm.nih.gov/clinvar?term=XXXXXXX[alleleid]`, or through ClinVar's web page using an 'advanced search' field

2. `clinvar_decisions.pm5.tsv`: A tab-separated file with headers, containing our PM5 missense decisions. All ClinVar entries in this file are Pathogenic Missense changes. Columns:
   - `transcript`: the transcript ID of the gene in which the missense change occurs
   - `codon`: the codon position of the missense change in that transcript
   - `clinvar_alleles`: `+`-delimited String, each entry being an `AlleleID::GoldStars` string, where `AlleleID` is the unique identifier for the ClinVar allele, and `GoldStars` is the number of stars assigned to that allele. e.g. `12345::3+67890::1`, indicating that allele `12345` has 3 stars, and allele `67890` has 1 star, and both affect the same codon in the same transcript.

## Usage

### Download Results

We aim to generate data monthly, and publish the results on Zenodo. The latest version of the data can be found at:

> https://zenodo.org/records/16777475

### Local Running

#### Downloading input files

A NextFlow workflow is provided to run the ClinvArbitration process locally. To use this process you will need reference files:

- a reference genome, in FASTA format
- a GFF3 file, containing gene annotations for the reference genome
- the files containing raw ClinVar submissions and variant details

A directory ([data](data)) and a script ([download_data.sh](data/download_files.sh)) are provided to download and store the required files. Running this script from the `data` directory will download and unpack all required files. The location these files are downloaded to matches the expected location in the Nextflow config, so you can run the workflow immediately after downloading.

The ClinVar Variant and Submission summary files are updated weekly. You should delete your local copy and re-download each time you run this workflow, to ensure you're capturing the latest data.

#### Running the workflow

The ClinvArbitration workflow can be run containerised, or locally. By default, the reference data will be read from a directory called `data`, and the outputs written to a directory `nextflow_outputs`.

Local execution requires:

- a Nextflow installation, to operate the workflow
- a Python environment, with the ClinvArbitration package and its dependencies installed
  - this can be actioned with `pip install .` from the root of this repository
- BCFtools, to annotate the ClinVar variants with gene information

```bash
nextflow -c nextflow/nextflow.config \
    run nextflow/clinvarbitration.nf
```

A containerised execution requires:

- a Nextflow installation, to operate the workflow
- a Docker installation, to run the workflow in a container

Step 1: build the Docker image:

```bash
docker build -t clinvarbitration:local .
```

Step 2: run the workflow using the Docker image:`

```bash
nextflow -c nextflow/nextflow.config \
    run nextflow/clinvarbitration.nf \
    -with-docker clinvarbitration:local
```

## CPG-Flow

Internally at CPG, this workflow is run using [CPG-Flow](https://github.com/populationgenomics/cpg-flow), an in-house Hail Batch based workflow executor. The following elements relate to that workflow:

* an [example config file](src/clinvarbitration/config_template.toml), with enough entries populated that a standard CPG user could dry-run the workflow locally
* a [workflow runner script](src/clinvarbitration/run_workflow.py)
* a definition of all [workflow stages](src/clinvarbitration/stages.py)

The intention is that once the Dockerfile within this repository is used, this workflow can be triggered like so:

```bash
analysis-runner \
    --skip-repo-checkout \
    --image australia-southeast1-docker.pkg.dev/cpg-common/images-dev/clinvarbitration:PR_24 \
    --config new_clinvarbitration.toml \
    --dataset seqr \
    --description 'resummarise_clinvar' \
    -o resummarise_clinvar \
    --access-level test \
    run_workflow
```

A config file is required containing a few entries, some relating to this workflow specifically, some relating to cpg-flow setup:

* `workflow.driver_image`: populated by analysis-runner, points to _this_ docker image
* `site_blacklist`: list of ClinVar submitters to ignore. Useful in removing noise, or blinding to _self_ submissions
* `ref_fasta`: required to run bcftools csq. Must match the `genome_build`
* `genome_build`: used to decide whether ClinVar/Annotation is sourced using GRCh37 or GRCh38 (default)

## Acknowledgements

* [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar), for providing the data which this process is based on
