Metadata-Version: 2.4
Name: epilink
Version: 0.1.3
Summary: Epidemiological linkage inference from temporal and genetic data with an E/P/I infectiousness model.
Project-URL: Homepage, https://github.com/ydnkka/epilink
Project-URL: Issues, https://github.com/ydnkka/epilink/issues
Project-URL: Documentation, https://github.com/ydnkka/epilink
Project-URL: Repository, https://github.com/ydnkka/epilink
Project-URL: Bug Tracker, https://github.com/ydnkka/epilink/issues
Project-URL: Release Notes, https://github.com/ydnkka/epilink/releases
Author-email: Dominic Arthur <arthurdominic04@gmail.com>
Maintainer-email: Dominic Arthur <arthurdominic04@gmail.com>
License: MIT
License-File: LICENSE
Keywords: epidemiology,genomics,infectious-disease,linkage-inference,sars-cov-2,transmission
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Requires-Dist: networkx>=3.0
Requires-Dist: numpy>=2.0
Requires-Dist: pandas>=2.0
Requires-Dist: scipy>=1.14
Provides-Extra: dev
Requires-Dist: black>=24.1; extra == 'dev'
Requires-Dist: mypy>=1.4; extra == 'dev'
Requires-Dist: pre-commit>=3.3; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.3; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

# EpiLink

EpiLink scores how compatible a pair of samples is with recent transmission scenarios using sampling-time differences and consensus genetic distance.

It is useful when you have:

- a sampling-time difference in days
- a consensus genetic distance in mutations
- a question like "is this pair more compatible with direct transmission or a recent shared ancestor?"

EpiLink returns per-scenario compatibility scores and can also sum scores across a user-defined target subset such as `["ad(0)", "ca(0,0)"]`.

## Installation

Clone the repository first if you are starting from GitHub:

```bash
git clone https://github.com/ydnkka/epilink.git
cd epilink
```

The repository environment is the easiest way to get everything needed for the package, examples, and simulation helpers:

```bash
conda env create -f environment.yml
conda activate epilink
```

If you prefer `pip`:

```bash
python -m pip install -e .
python -m pip install networkx pandas
```

EpiLink requires Python 3.10 or newer.

## Scenario labels

- `ad(0)`: direct ancestor-descendant transmission
- `ad(1)`: ancestor-descendant transmission with one hidden intermediate
- `ca(0,0)`: a recent shared common ancestor with one branch to each sampled case
- `ca(m_i,m_j)`: a common-ancestor scenario with `m_i` and `m_j` hidden generations on each branch

`maximum_depth` controls how many of these latent scenarios are generated.

## Which method to use

- `score_pair(...)`: one observed pair, plus a full per-scenario breakdown
- `score_target(...)`: only the target score, for scalar or array inputs
- `pairwise_model(...)`: a cached scorer for repeatedly evaluating the same target subset

Each individual scenario compatibility lies in `[0, 1]`. If `target` contains multiple scenarios, `target_compatibility` is the sum across that subset, so it can be greater than `1`.

## Quick start

```python
from epilink import EpiLink, InfectiousnessToTransmission

profile = InfectiousnessToTransmission(rng_seed=2026)

model = EpiLink(
    transmission_profile=profile,
    maximum_depth=2,
    mc_samples=20000,
    target=["ad(0)", "ca(0,0)"],
    mutation_process="stochastic",
)

result = model.score_pair(
    sample_time_difference=3.0,
    genetic_distance=2.0,
)

print(result["target_labels"])
print(result["target_compatibility"])
print(result["scenario_scores"]["ad(0)"]["compatibility"])
```

## More examples

### Score only a target subset

Use `score_target` when you only care about the combined score:

```python
score = model.score_target(
    sample_time_difference=3.0,
    genetic_distance=2.0,
    target=["ad(0)", "ad(1)", "ca(0,0)"],
)

print(score)
```

### Use `Scenario` objects instead of strings

```python
from epilink import Scenario

score = model.score_target(
    sample_time_difference=3.0,
    genetic_distance=2.0,
    target=[
        Scenario(kind="ad", intermediates=0),
        Scenario(kind="ca", branch_to_i=0, branch_to_j=0),
    ],
)
```

### Score many pairs at once

`score_target` and `pairwise_model` broadcast NumPy inputs, so you can score a whole grid or batch efficiently:

```python
import numpy as np

pairwise = model.pairwise_model(target=["ad(0)", "ca(0,0)"])

time_differences = np.array([[0.0], [2.0], [4.0]])
genetic_distances = np.array([[0.0, 1.0, 2.0, 3.0]])

scores = pairwise(time_differences, genetic_distances)
print(scores.shape)  # (3, 4)
```

### Build a toy simulated pair table

The simulation helpers are useful for generating synthetic examples and benchmarking downstream workflows:

```python
import networkx as nx

from epilink import (
    build_pairwise_case_table,
    simulate_epidemic_dates,
    simulate_genomic_sequences,
)

tree = nx.DiGraph(
    [
        ("case-0", "case-1"),
        ("case-0", "case-2"),
    ]
)

dated_tree = simulate_epidemic_dates(profile, tree, fraction_sampled=1.0)
simulated = simulate_genomic_sequences(profile, dated_tree, genome_length=500)
pair_table = build_pairwise_case_table(simulated["packed"], dated_tree)

print(pair_table.head())
```

## Mutation models

- `mutation_process="deterministic"` compares the observation with expected mutation counts
- `mutation_process="stochastic"` compares the observation with Poisson mutation-count draws

The stochastic option is usually the better choice when you want mutation-count variability to be part of the score.

## Background and model characteristics
- Manuscript: [docs/assets/epilink.pdf](docs/assets/epilink.pdf)
- Latent histories: [docs/assets/epilink_scenarios.svg](docs/assets/epilink_scenarios.svg)
- Workflow figure: [docs/assets/epilink_schematic.svg](docs/assets/epilink_schematic.svg)
- Notebook: [docs/epilink_characterisation.ipynb](docs/epilink_characterisation.ipynb)

## Citation

If you use EpiLink in research, please cite the software metadata in [CITATION.cff](CITATION.cff). The underlying infectiousness model is:

1. Hart WS, Maini PK, Thompson RN. High infectiousness immediately before COVID-19 symptom onset highlights the importance of continued contact tracing. *eLife*. 2021;10:e65534. <http://dx.doi.org/10.7554/eLife.65534>

## License

MIT. See [LICENSE](LICENSE).
