Metadata-Version: 2.4
Name: PyMSAStats
Version: 0.1.0
Summary: Pure-Python MSA summary statistics
Author: Naiel J
License-Expression: MIT
Project-URL: Homepage, https://github.com/naielj/PyMSAStats
Project-URL: Repository, https://github.com/naielj/PyMSAStats
Project-URL: Issues, https://github.com/naielj/PyMSAStats/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# PyMSAStats (pure-Python `msastats`)

A pure-Python implementation of the `msastats` API used by MSACompare for computing
Multiple Sequence Alignment (MSA) summary statistics.

This package installs and imports as `msastats`:

```python
import msastats
```

## Usage

You can use either the high-level `msastats` functions (drop-in API), or the
lower-level `MsaStatsCalculator`.

### Installation

Install from PyPI:

```bash
pip install pymsastats
```

Or with `uv`:

```bash
uv add pymsastats
```

For development (editable install from source):

```bash
git clone https://github.com/naielj/PyMSAStats.git
cd PyMSAStats
uv sync
```

### Drop-in API (Recommended)

From a list of aligned sequences:

```python
import msastats

stats = msastats.calculate_msa_stats(["AA-A", "AA-A", "A--A"])
names = msastats.stats_names()

as_dict = dict(zip(names, stats))
print(as_dict["AVG_GAP_SIZE"])
```

From an aligned FASTA file:

```python
import msastats

stats = msastats.calculate_fasta_stats("path/to/alignment.fasta")
print(stats)
```

### Calculator API

From a list of aligned sequences:

```python
from msastats import MsaStatsCalculator, StatType

# 1. Initialize with a list of aligned sequences
sequences = [
    "AC--GT",
    "ACGTGT",
    "AC-TGT",
]
calculator = MsaStatsCalculator(sequences)

# 2. Compute the statistics
calculator.recompute_stats()

# 3. Access the statistics
print(f"MSA Length: {calculator.msa_length}")
print(f"Number of Sequences: {calculator.number_of_sequences}")
print(f"Total Gaps: {calculator.total_number_of_indels}")

# Or access stats by type
avg_gap_size = calculator.get_stat_by_type(StatType.AVG_GAP_SIZE)
print(f"Average Gap Size: {avg_gap_size:.2f}")

# Get all stats as a vector
stats_vector = calculator.get_stat_vec()
print(f"Stats Vector: {stats_vector}")
```

From an aligned FASTA file:

```python
from pathlib import Path
from msastats import MsaStatsCalculator, StatType

# 1. Create a dummy FASTA file
fasta_content = """>seq1
AC--GT--
>seq2
ACGTGT--
>seq3
AC-TGTAC
"""
fasta_path = Path("dummy.fasta")
fasta_path.write_text(fasta_content)

# 2. Initialize from the FASTA file
calculator = MsaStatsCalculator.from_fasta(fasta_path)

# 3. Compute the statistics
calculator.recompute_stats()

# 4. Access the statistics
print(f"MSA Length: {calculator.msa_length}")
print(f"Longest Sequence: {calculator.msa_longest_seq_length}")
print(f"Shortest Sequence: {calculator.msa_shortest_seq_length}")
print(f"Total Number of Gaps: {calculator.total_number_of_indels}")

# Clean up the dummy file
fasta_path.unlink()
```

## Statistics reference (27 metrics)

The 27 summary statistics implemented here are defined in:

> Wygoda E, Loewenthal G, Moshe A, Alburquerque M, Mayrose I, Pupko T. Statistical framework to determine indel-length distribution. *Bioinformatics*. 2024;40(2):btae043. <https://doi.org/10.1093/bioinformatics/btae043>

```bibtex
@article{wygoda2024indel,
    author = {Wygoda, Elya and Loewenthal, Gil and Moshe, Asher and Alburquerque, Michael and Mayrose, Itay and Pupko, Tal},
    title = {Statistical framework to determine indel-length distribution},
    journal = {Bioinformatics},
    volume = {40},
    number = {2},
    pages = {btae043},
    year = {2024},
    doi = {10.1093/bioinformatics/btae043}
}
```

For an illustrated reference with a worked example showing all 27 metric values on a single MSA, see
[docs/summary_statistics_reference.md](docs/summary_statistics_reference.md).

### Terminology

- **Gap character:** the implementation treats `-` as a gap.
- **All-gap column:** a column where *all* sequences have `-` at that position.
- **Gap run / indel (what “gap” means in most metrics):** a maximal contiguous run of `-` in a single sequence.
- **All-gap trimming (important):** before detecting gap runs, the algorithm removes all-gap columns. This matches the
  original C++ implementation and prevents all-gap columns from splitting/creating gap runs.
- **Unique gap interval:** gap runs are grouped by their `(start, end)` coordinates *in the trimmed alignment*. If
  multiple sequences have a gap run with the same `(start, end)`, that is **one** unique gap interval with:
  - `length = end - start + 1`
  - `count = number of sequences that have that exact interval`

### Returned order

`calculate_msa_stats` / `calculate_fasta_stats` return a list of 27 floats in the same order as
`msastats.stats_names()` (and the `StatType` enum).

### Metric definitions

#### Alignment and sequence lengths

- `MSA_LEN` (MSACompare: `LINE_LENGTH`): alignment length in columns (includes all-gap columns).
- `LONGEST_UNALIGNED_SEQ` (MSACompare: `LONGEST_UNALIGNED_SEQ_LENGTH`): max ungapped sequence length across sequences
  (`len(seq.replace('-', ''))`).
- `SHORTEST_UNALIGNED_SEQ` (MSACompare: `SHORTEST_UNALIGNED_SEQ_LENGTH`): min ungapped sequence length across sequences.

#### Gap-run totals (after all-gap trimming)

Let the set of unique gap intervals be `U`, and for each `u ∈ U`, let `u.length` be its length and `u.count` be how
many sequences contain that interval.

- `TOT_NUM_GAPS` (MSACompare: `TOTAL_GAPS`): total number of gap runs across sequences:
  `Σ_u u.count`
- `AVG_GAP_SIZE` (MSACompare: `AVG_LENGTH_OF_GAPS`): mean gap-run length across all sequences:
  `(Σ_u u.length · u.count) / TOT_NUM_GAPS` (0 if `TOT_NUM_GAPS == 0`)
- `NUM_GAPS_LEN_ONE` (MSACompare: `GAPS_OF_LENGTH_ONE`): `Σ_{u.length==1} u.count`
- `NUM_GAPS_LEN_TWO` (MSACompare: `GAPS_OF_LENGTH_TWO`): `Σ_{u.length==2} u.count`
- `NUM_GAPS_LEN_THREE` (MSACompare: `GAPS_OF_LENGTH_THREE`): `Σ_{u.length==3} u.count`
- `NUM_GAPS_LEN_AT_LEAST_FOUR` (MSACompare: `GAPS_LARGER_THAN_THREE`): `Σ_{u.length>=4} u.count`

#### Unique gap intervals (after all-gap trimming)

- `TOT_NUM_UNIQUE_GAPS` (MSACompare: `TOTAL_UNIQUE_GAPS`): number of unique gap intervals: `|U|`
- `AVG_UNIQUE_GAP_SIZE` (MSACompare: `AVG_SIZE_OF_UNIQUE_GAPS`): mean unique-gap length:
  `(Σ_u u.length) / TOT_NUM_UNIQUE_GAPS` (0 if `TOT_NUM_UNIQUE_GAPS == 0`)

#### Unique gap intervals shared by k sequences (after all-gap trimming)

These count **unique** gap intervals (not total occurrences). For a given interval length bucket, they count how many
intervals have `u.count == k`.

Length 1:
- `NUM_GAPS_LEN_ONE_IN_ONE_SEQ` (MSACompare: `GAPS_LENGTH_ONE_ONE_SEQ`): `|{u: u.length==1 and u.count==1}|`
- `NUM_GAPS_LEN_ONE_IN_TWO_SEQS` (MSACompare: `GAPS_LENGTH_ONE_TWO_SEQ`): `|{u: u.length==1 and u.count==2}|`
- `NUM_GAPS_LEN_ONE_IN_ALL_EXCEPT_ONE` (MSACompare: `GAPS_LENGTH_ONE_EXCEPT_ONE`):
  `|{u: u.length==1 and u.count==N-1}|`

Length 2:
- `NUM_GAPS_LEN_TWO_IN_ONE_SEQ` (MSACompare: `GAPS_LENGTH_TWO_ONE_SEQ`): `|{u: u.length==2 and u.count==1}|`
- `NUM_GAPS_LEN_TWO_IN_TWO_SEQS` (MSACompare: `GAPS_LENGTH_TWO_TWO_SEQ`): `|{u: u.length==2 and u.count==2}|`
- `NUM_GAPS_LEN_TWO_IN_ALL_EXCEPT_ONE` (MSACompare: `GAPS_LENGTH_TWO_EXCEPT_ONE`):
  `|{u: u.length==2 and u.count==N-1}|`

Length 3:
- `NUM_GAPS_LEN_THREE_IN_ONE_SEQ` (MSACompare: `GAPS_LENGTH_THREE_ONE_SEQ`): `|{u: u.length==3 and u.count==1}|`
- `NUM_GAPS_LEN_THREE_IN_TWO_SEQS` (MSACompare: `GAPS_LENGTH_THREE_TWO_SEQ`): `|{u: u.length==3 and u.count==2}|`
- `NUM_GAPS_LEN_THREE_IN_ALL_EXCEPT_ONE` (MSACompare: `GAPS_LENGTH_THREE_EXCEPT_ONE`):
  `|{u: u.length==3 and u.count==N-1}|`

Length ≥ 4:
- `NUM_GAPS_LEN_AT_LEAST_FOUR_IN_ONE_SEQ` (MSACompare: `GAPS_LARGER_THAN_THREE_ONE_SEQ`):
  `|{u: u.length>=4 and u.count==1}|`
- `NUM_GAPS_LEN_AT_LEAST_FOUR_IN_TWO_SEQS` (MSACompare: `GAPS_LARGER_THAN_THREE_TWO_SEQ`):
  `|{u: u.length>=4 and u.count==2}|`
- `NUM_GAPS_LEN_AT_LEAST_FOUR_IN_ALL_EXCEPT_ONE` (MSACompare: `GAPS_LARGER_THAN_THREE_EXCEPT_ONE`):
  `|{u: u.length>=4 and u.count==N-1}|`

Important edge-case note: to match the C++ reference, the "`== 1` / `== 2` / `== N-1`" checks are independent.
So for small `N`, some categories overlap:
- If `N == 2`, then `N-1 == 1`, so "`IN_ONE_SEQ`" and "`IN_ALL_EXCEPT_ONE`" count the same intervals.
- If `N == 3`, then `N-1 == 2`, so "`IN_TWO_SEQS`" and "`IN_ALL_EXCEPT_ONE`" count the same intervals.

#### Column-wise gap counts (computed on the original alignment, before all-gap trimming)

These count alignment columns based on how many sequences have a `-` in that column:

- `MSA_POSITION_WITH_0_GAPS` (MSACompare: `NO_GAP_COLUMNS`): number of columns with exactly 0 gaps
- `MSA_POSITION_WITH_1_GAPS` (MSACompare: `ONE_GAP_COLUMNS`): number of columns with exactly 1 gap
- `MSA_POSITION_WITH_2_GAPS` (MSACompare: `TWO_GAP_COLUMNS`): number of columns with exactly 2 gaps
- `MSA_POSITION_WITH_N_MINUS_1_GAPS` (MSACompare: `ONE_GAP_EXCEPT_ONE_COLUMN`): number of columns with exactly `N-1` gaps

Notes:
- Columns with `N` gaps (all-gap columns) are **not** counted in any of these buckets.
- For `N == 3`, the `N-1` bucket is effectively 0 because columns with 2 gaps are already counted in
  `MSA_POSITION_WITH_2_GAPS` (this matches the C++ reference behavior).

### Examples

Helper to get a readable dict:

```python
import msastats

def stats_dict(msa):
    return dict(zip(msastats.stats_names(), msastats.calculate_msa_stats(msa)))
```

#### Example 1: One gap interval shared by 2 of 3 sequences

```python
msa = ["A--A", "A--A", "AAAA"]  # N=3, L=4
stats = stats_dict(msa)

assert stats["MSA_LEN"] == 4.0
assert stats["LONGEST_UNALIGNED_SEQ"] == 4.0
assert stats["SHORTEST_UNALIGNED_SEQ"] == 2.0

# One unique gap interval of length 2, appearing in 2 sequences:
assert stats["TOT_NUM_UNIQUE_GAPS"] == 1.0
assert stats["TOT_NUM_GAPS"] == 2.0
assert stats["AVG_GAP_SIZE"] == 2.0
assert stats["NUM_GAPS_LEN_TWO"] == 2.0

# Column-wise gap counts (original alignment):
assert stats["MSA_POSITION_WITH_0_GAPS"] == 2.0
assert stats["MSA_POSITION_WITH_2_GAPS"] == 2.0
```

#### Example 2: All-gap columns are ignored for gap-run detection

```python
msa = ["A-A", "A-A", "A-A"]  # middle column is all gaps
stats = stats_dict(msa)

assert stats["MSA_LEN"] == 3.0

# The all-gap column is removed before gap runs are detected, so there are no gaps:
assert stats["TOT_NUM_GAPS"] == 0.0
assert stats["TOT_NUM_UNIQUE_GAPS"] == 0.0
assert stats["AVG_GAP_SIZE"] == 0.0

# But column-wise counts still “see” the original alignment columns:
assert stats["MSA_POSITION_WITH_0_GAPS"] == 2.0
```

#### Example 3: “total gap runs” vs “unique gap intervals”

```python
msa = ["A-A", "A-A", "AAA", "AAA"]  # one gap interval shared by 2 sequences
stats = stats_dict(msa)

# Total occurrences (one per sequence that has it):
assert stats["NUM_GAPS_LEN_ONE"] == 2.0

# Unique intervals are counted once:
assert stats["TOT_NUM_UNIQUE_GAPS"] == 1.0
assert stats["NUM_GAPS_LEN_ONE_IN_TWO_SEQS"] == 1.0
```
