Metadata-Version: 2.1
Name: psite-annotation
Version: 0.3.0
Summary: Module for annotating p-sites based on resources such as PhosphoSitePlus
License: Apache-2.0
Author: Matthew The
Author-email: matthew.the@tum.de
Requires-Python: >=3.8.1,<4.0
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: config-path (==1.0.3)
Requires-Dist: numpy (>=1.18.1,<2.0.0)
Requires-Dist: pandas (>=1.3.0,<2.0.0)
Description-Content-Type: text/markdown

# Psite annotation

Python module for annotating a pandas dataframe with phosphosites, e.g. PhosphoSitePlus annotations, kinase-substrate relations, domain information, etc.

## Installation

If you have setup SSH keys in gitlab, you can easily install this package with:

```
pip install psite-annotation
```

Otherwise, you can clone this repository and install it with pip manually:

```
git clone https://www.github.com/kusterlab/psite_annotation.git
cd psite_annotation
pip install .
```

### Installing annotation files

The easiest way to use the package is to supply the annotation files each time you call the respective functions.
Alternatively, it is also possible to install a configuration file to automatically point to the annotation files.

To set the paths to the annotation files, create a `config.json` file with the following content:

```
{
    "domainMappingFile": "/path/to/uniprot_to_domain.csv",
    "inVitroKinaseSubstrateMappingFile": "/path/to/yasushi_supp_table2_kinase_substrate_relations_mapped_ids.tsv",
    "motifsFile": "/path/to/motifs_all.tsv",
    "turnoverFile": "/path/to/TurnoverSites.csv",
    "pspFastaFile": "/path/to/Phosphosite_seq.fasta",
    "pspKinaseSubstrateFile": "/path/to/Kinase_Substrate_Dataset",
    "pspAnnotationFile": "/path/to/Phosphorylation_site_dataset",
    "pspRegulatoryFile": "/path/to/Regulatory_sites"
}
```

Where `/path/to` should be replaced by the absolute path to the annotation files.

For Kusterlab internal users:

- ask Matthew for the config file.

For external users:

- PhosphoSitePlus annotation files can be downloaded from https://www.phosphosite.org/staticDownloads.action (account needed).
- The other annotation files are available from the `annotations.zip` file in this repository.

You can then install this config file with:

```
python -c "import psite_annotation.config as c; c.setUserConfig('./config.json')"
```

## Usage

To add upstream kinases to a pandas dataframe `df` with columns `Proteins` (UniProt identifiers separated by semicolons, e.g. `Q86U42-2;Q86U42`) and `Modified sequence` (standard MaxQuant notation, e.g. `(ac)AAAAAAAAAAGAAGGRGS(ph)GPGR`):

```
import psite_annotation as pa

df = pa.addPeptideAndPsitePositions(df, pa.pspFastaFile, pspInput = True)
df = pa.addPSPKinaseSubstrateAnnotations(df, pa.pspKinaseSubstrateFile)
```

We first need to annotate the peptide and modification positions within the protein using `addPeptideAndPsitePositions()`. This adds a column with an identifier for the phosphosite, which can then be mapped to the phosphorylating kinase using `addPSPKinaseSubstrateAnnotations()`.

## Functions

### pa.addPeptideAndPsitePositions()

```
# input: pandas dataframe with 'Proteins' (Usually the Uniprot ID) and 'Modified sequence' columns
# output: pandas dataframe with the following added columns:
#   'Start positions' = starting positions of the modified peptide in the protein sequence (1-based, methionine is counted). If multiple isoforms/proteins contain the sequence, the starting positions are separated by semicolons in the same order as they are listed in the 'Proteins' input column
#   'End positions' = end positions of the modified peptide in the protein sequence (see above for details)
#   'Site sequence context' = +/- 15 amino acids around each of the modified sites, separated by semicolons
#   'Site positions' = position of the modification (see 'Start positions' above for details on how the position is counted)
```

Usage:

```
df = pa.addPeptideAndPsitePositions(df, fastaFile)
```

### pa.addPSPAnnotations()

```
# input: pandas dataframe with 'Site positions' column (this column can be obtained from the addPeptideAndPsitePositions() function)
# output: pandas dataframe with the following added columns:
#   PSP_LT_LIT = number of low-throughput studies
#   PSP_MS_LIT = number of high-throughput Mass Spec studies
#   PSP_MS_CST = number of high-throughput Mass Spec studies by CellSignalingTechnologies
```

Usage:

```
df = pa.addPeptideAndPsitePositions(df, pa.pspFastaFile, pspInput = True)
df = pa.addPSPAnnotations(df, pa.pspAnnotationFile)
```

### pa.addPSPRegulatoryAnnotations()

```
# input: pandas dataframe with 'Site positions' column (this column can be obtained from the addPeptideAndPsitePositions() function)
# output: pandas dataframe with the following added columns:
#   PSP_ON_FUNCTION = functional annotations for downstream regulation
#   PSP_ON_PROCESS = process annotations for downstream regulation
#   PSP_ON_PROT_INTERACT = protein interactions
#   PSP_ON_OTHER_INTERACT = other interactions
#   PSP_NOTES = regulatory site notes
```

Usage:

```
df = pa.addPeptideAndPsitePositions(df, pa.pspFastaFile, pspInput = True)
df = pa.addPSPRegulatoryAnnotations(df, pa.pspRegulatoryFile)
```

### pa.addPSPKinaseSubstrateAnnotations()

```
# input: pandas dataframe with 'Site positions' column (this column can be obtained from the addPeptideAndPsitePositions() function)
# output: pandas dataframe with the following added columns:
#   'PSP Kinases' = all phosphorylating kinases according to PhosphoSitePlus, no distinction is made between in vivo and in vitro evidence (this can be added in the future, if necessary)
```

Usage:

```
df = pa.addPeptideAndPsitePositions(df, pa.pspFastaFile, pspInput = True)
df = pa.addPSPKinaseSubstrateAnnotations(df, pa.pspKinaseSubstrateFile)
```

### pa.addDomains()

```
# input: pandas dataframe with 'Proteins', 'Start positions' and 'End positions' columns (the latter two are obtained from the addPeptideAndPsitePositions() function)
# output: pandas dataframe with the following added columns:
#   Domains = domains that overlap with the modified peptide sequence
```

Usage:

```
df = pa.addPeptideAndPsitePositions(df, fastaFile)
df = pa.addDomains(df, pa.domainMappingFile)
```

### pa.addMotifs()

```
# input: pandas dataframe with 'Site sequence context' columns (the latter two are obtained from the addPeptideAndPsitePositions() function)
# output: pandas dataframe with the following added columns:
#   Motifs = matching motif identifiers for all of the modified sites
```

Usage:

```
df = pa.addPeptideAndPsitePositions(df, fastaFile)
df = pa.addMotifs(df, pa.motifsFile)
```

### pa.addInVitroKinases()

```
# input: pandas dataframe with 'Site positions' column (this column can be obtained from the addPeptideAndPsitePositions() function)
# output: pandas dataframe with the following added columns:
#   'In Vitro Kinases' = all phosphorylating kinases according to the Yasushi in vitro kinase-substrate study
```

Usage:

```
df = pa.addPeptideAndPsitePositions(df, fastaFile)
df = pa.addInVitroKinases(df, pa.inVitroKinaseSubstrateMappingFile)
```

### pa.addTurnoverRates()

```
# input: pandas dataframe with 'Modified sequence' column
# output: pandas dataframe with the following added columns:
#   'PTM Turnover' = rate of turnover for the modification sites according to Jana's PTM Turnover data, e.g. slower, faster
```

Usage:

```
df = pa.addTurnoverRates(df, pa.turnoverFile)
```

