Metadata-Version: 2.1
Name: kmerpapa
Version: 0.2.2
Summary: Tool to calculate a k-mer pattern partition from position specific k-mer counts.
Home-page: https://github.com/besenbacherLab/kmerpapa
License: MIT
Author: Søren Besenbacher
Author-email: besenbacher@clin.au.dk
Requires-Python: >=3.7,<3.10
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: numba (>=0.54.1,<0.55.0)
Requires-Dist: numpy (>=1.17,<1.21)
Requires-Dist: scipy (>=1.7.1,<2.0.0)
Project-URL: Repository, https://github.com/besenbacherLab/kmerpapa
Description-Content-Type: text/markdown

# kmerPaPa
Tool to calculate a "k-mer pattern partition" from position specific k-mer counts. This can for instance be used to train a mutation rate model.

## Requirements
kmerPaPa requires Python 3.7 or above.

## Installation
kmerPaPa can be installed using pip:
```
pip install kmerpapa
```
or using [pipx](https://pypa.github.io/pipx/):
```
pipx install kmerpapa
```

## Test data
The test data files used in the usage examples below can be downloaded from the test_data directory in the project's github repository:
```
wget https://github.com/BesenbacherLab/kmerPaPa/raw/main/test_data/mutated_5mers.txt
wget https://github.com/BesenbacherLab/kmerPaPa/raw/main/test_data/background_5mers.txt
```

## Usage
If we want to train a mutation rate model then the input data should specifiy the number of times each k-mer is observed mutated and unmutated. One option is to have one file with the mutated k-mer counts (positive) and one file with the count of k-mers in the whole genome (background).  We can then run kmerpapa like this:
```
kmerpapa --positive mutated_5mers.txt \
         --background background_5mers.txt \
         --penalty_values 3 5 7
```
The above command will first use cross validation to find the best penalty value between the values 3,5 and 7. Then it will find the optimal k-mer patter partiton using that penalty value.
If both a list of penalty values and a list of pseudo-counts are specified then all combinations of values will be tested during cross validation:
```
kmerpapa --positive mutated_5mers.txt \
         --background background_5mers.txt \
         --penalty_values 3 5 6 \
         --pseudo_counts 0.5 1 10
```
If only a single combination of penalty_value and pseudo_count is provided then the default is not to run cross validation unless "--n_folds" option or the "CV_only" is used. The "CV_only" option can be used together with "--CVfile" option to parallelize grid search.
Fx. using bash:
```
for c in 3 5 6; do
    for a in 0.5 1 10; do
        kmerpapa --positive mutated_5mers.txt \
         --background background_5mers.txt \
         --penalty_values $c \
         --pseudo_counts $a \
         --CV_only --CVfile CV_results_c${c}_a${a}.txt &
    done
done
```

