Metadata-Version: 2.4
Name: distfeat
Version: 0.5.0
Summary: Standalone phonological feature systems for historical linguistics
Author-email: Tiago Tresoldi <tiago.tresoldi@lingfil.uu.se>
License-Expression: MIT
Project-URL: Homepage, https://github.com/tresoldi/distfeat
Project-URL: Documentation, https://github.com/tresoldi/distfeat#readme
Project-URL: Repository, https://github.com/tresoldi/distfeat
Project-URL: Bug Tracker, https://github.com/tresoldi/distfeat/issues
Keywords: phonology,phonological features,historical linguistics,sequence alignment,cognate detection
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: build; extra == "dev"
Dynamic: license-file

# distfeat

`distfeat` is a standalone Python package for manipulating phonological
features.

It provides:

- bundled phonological feature datasets
- pluggable feature systems
- feature geometry and distance functions
- query and analysis helpers for graphemes and feature sets

`distfeat` is dependency-free at runtime and is the standalone home for the
feature subsystem extracted from `alteruphono`.

The canonical modern API is built around native representations:

- use `get_representation(...)` when you want the system's native feature model
- use `matches(...)` and `segment_distance(...)` for system-native comparison
- treat `get_features(...)`, `partial_match(...)`, and `sound_distance(...)` as
  convenience helpers for categorical systems

## Installation

Install from PyPI:

```bash
pip install distfeat
```

Requires Python 3.12+.

Development install:

```bash
git clone https://github.com/tresoldi/distfeat.git
cd distfeat
uv venv
uv pip install -e ".[dev]"
```

Run checks in the project environment:

```bash
uv run ruff check .
uv run mypy src
uv run pytest -q
uv run python scripts/verify_examples.py
```

## Core Concepts

The package is organized around:

- a bundled `FeatureDataset`
- a lazy default registry plus explicit `Registry` instances
- built-in systems:
  - `ipa`
  - `tresoldi`
  - `distinctive`
  - `pbase-hc`
  - `pbase-jfh`
  - `pbase-spe`
  - `pbase-uftc`

The package does not define a `Sound` object. It works directly with graphemes,
feature bundles, native multi-state feature tables, scalar dimensions, and
matrices.

## Quick Start

```python
import distfeat

# Built-in systems
print(distfeat.list_systems())
# ['ipa', 'tresoldi', 'distinctive', 'pbase-hc', 'pbase-jfh', 'pbase-spe', 'pbase-uftc']

# Basic grapheme lookup
print(distfeat.get_features("p"))
# frozenset({'consonant', 'voiceless', 'bilabial', 'stop'})

# Predefined sound classes
print(distfeat.get_class_features("V"))
# frozenset({'vowel'})

# Direct grapheme distance
print(distfeat.distance("a", "e"))
```

## Working With Systems

You can use the lazy default registry through top-level helpers, or you can
work with a specific system object.

```python
import distfeat

ipa = distfeat.get_system("ipa")
tresoldi = distfeat.get_system("tresoldi")
distinctive = distfeat.get_system("distinctive")
pbase = distfeat.get_system("pbase-hc")

print(ipa.grapheme_to_features("a"))
print(tresoldi.grapheme_to_features("a"))
print(distinctive.grapheme_to_features("a"))
print(pbase.grapheme_to_representation("a"))
```

Exact reverse lookup is available when a native representation maps directly to
a known grapheme. For categorical systems this is usually a `frozenset[str]`;
for valued systems it can be a `dict[str, FeatureState | str]` or
`ValuedFeatures`.

```python
ipa = distfeat.get_system("ipa")

grapheme = ipa.features_to_grapheme(
    frozenset({"consonant", "voiced", "bilabial", "stop"})
)
print(grapheme)
# 'b'
```

## Feature Queries

### Find Graphemes Matching a Feature Set

Use `features_to_graphemes(...)` to retrieve all graphemes satisfying a
feature query.

By default, matching is partial and uses the semantics of the selected system.

```python
import distfeat

# All vowels in the default system
vowels = distfeat.features_to_graphemes(frozenset({"vowel"}))
print(vowels[:10])

# Voiceless consonants
voiceless_consonants = distfeat.features_to_graphemes(
    frozenset({"consonant", "-voiced"})
)
print(voiceless_consonants[:10])
```

You can also force exact matching:

```python
import distfeat

ipa = distfeat.get_system("ipa")
features = ipa.grapheme_to_features("a")
print(distfeat.features_to_graphemes(features, exact=True))
```

## Native Multi-State Systems

`distfeat` also supports systems whose native representation is a named
feature-value table instead of a categorical set. The bundled P-base-derived
systems expose multi-state values such as `+`, `-`, `n`, `.`, `o`, and `x`
through `FeatureState`.

```python
import distfeat

rep = distfeat.get_representation("a", system="pbase-hc")
print(rep.values["syllabic"])
# FeatureState.POSITIVE

matches = distfeat.features_to_graphemes({"syllabic": "+"}, system="pbase-hc")
print(matches[:10])
```

The bundled P-base table is intentionally described as derived rather than
verbatim. The source data contains duplicate IPA rows, including rows with
conflicting values in a small number of columns. `distfeat` merges duplicate
rows conservatively:

- identical duplicate rows collapse into one row
- if duplicate rows disagree, only the conflicting cells are downgraded to `.`
  (`FeatureState.DOT`)

This preserves a single usable row per grapheme without inventing new positive
or negative values where the source disagrees.

### Derive Shared Class Features

Use `derive_class_features(...)` to compute the strict shared feature
intersection of a set of graphemes.

```python
import distfeat

print(distfeat.derive_class_features(["t", "d"]))
# frozenset({'consonant', 'alveolar', 'stop', ...})

print(distfeat.derive_class_features(["t", "d", "s"]))
# fewer shared features than the pair above
```

For multi-state systems, the result is a dictionary of shared feature states:

```python
import distfeat

print(distfeat.derive_class_features(["t", "d"], system="pbase-hc"))
# {'consonantal': <FeatureState.POSITIVE: '+'>, ...}
```

## Minimal Distinguishing Matrices

Use `minimal_matrix(...)` to compute the smallest feature set needed to
distinguish a given list of graphemes.

```python
import distfeat

matrix = distfeat.minimal_matrix(["t", "d"], system="ipa")
print(matrix.columns)
print(matrix.rows)
```

For `ipa` and `tresoldi`, the matrix is categorical and boolean. For
`distinctive`, it uses scalar dimensions. For P-base-derived systems, it uses
native multi-state values.

```python
import distfeat

matrix = distfeat.minimal_matrix(["t", "d", "s"], system="ipa")
print(distfeat.tabulate_matrix(matrix))
```

Example plain-text output:

```text
grapheme | continuant | voiced
---------+------------+-------
t        | False      | False
d        | False      | True
s        | True       | False
```

Markdown output is also supported:

```python
print(distfeat.tabulate_matrix(matrix, format="markdown"))
```

P-base-derived systems render symbolic state values directly:

```python
import distfeat

matrix = distfeat.minimal_matrix(["t", "d"], system="pbase-hc")
print(distfeat.tabulate_matrix(matrix))
```

## Distinctive Scalars

The `distinctive` system also exposes scalar representations.

```python
from distfeat import DistinctiveFeatureSystem, load_builtin_dataset

system = DistinctiveFeatureSystem(dataset=load_builtin_dataset())

print(system.grapheme_to_scalars("a"))
print(system.features_to_scalars(system.grapheme_to_features("a")))
print(system.scalars_to_features({"voice": 1.0, "labial": 1.0}))
```

## Distance

### System-Based Distance

The default `distance(...)` helper resolves graphemes through the selected
system and uses that system's native distance.

```python
import distfeat

print(distfeat.distance("a", "e"))
print(distfeat.distance("a", "u"))
print(distfeat.distance("p", "b"))
print(distfeat.distance("t", "d", system="pbase-hc"))
```

### Precomputed Distance Matrices

You can also supply a precomputed nested dictionary.

```python
import distfeat

precomputed = {
    "a": {"e": 1.5, "u": 2.0},
    "p": {"b": 0.5},
}

print(distfeat.distance("a", "e", precomputed=precomputed))
print(distfeat.distance("b", "p", precomputed=precomputed))
```

If a requested pair is missing from the precomputed matrix, the function raises
`KeyError`.

## Custom Datasets

### Load From a Directory

```python
from distfeat import create_registry, load_dataset

dataset = load_dataset(directory="my_feature_data")
registry = create_registry(dataset=dataset)
system = registry.get_system("ipa")

print(system.grapheme_to_features("k"))
```

Expected files in `my_feature_data/`:

- `sounds.tsv`
- `classes.tsv`
- `features.tsv`

## Bundled P-base-Derived Data

`distfeat` bundles a derived segment table based on the P-base distribution.
The bundled systems are:

- `pbase-hc`
- `pbase-jfh`
- `pbase-spe`
- `pbase-uftc`

These systems use the same registry and analysis APIs as the categorical and
scalar systems, but operate on native multi-state feature values.

The P-base-derived data is bundled separately from the MIT-licensed code and
retains its own attribution and license notice in `src/distfeat/data/pbase/`.

### Build From In-Memory Rows

```python
from distfeat import create_registry, dataset_from_rows
from distfeat.systems.ipa import IPAFeatureSystem

dataset = dataset_from_rows(
    sounds={"a": "open front vowel", "p": "voiceless bilabial consonant stop"},
    classes={"V": ("vowel", "vowel", ["a"])},
    features=[("open", "height"), ("front", "centrality"), ("stop", "manner")],
)

registry = create_registry(dataset=dataset, register_builtin=False)
registry.register("ipa", IPAFeatureSystem(dataset))

print(registry.get_system("ipa").grapheme_to_features("a"))
```

## Explicit Registries

Use explicit registries when you want isolated state instead of the default
global registry.

```python
from distfeat import create_registry, load_builtin_dataset

registry = create_registry(dataset=load_builtin_dataset())
registry.set_default("tresoldi")

print(registry.get_system().name)
print(registry.list_systems())
```

## What The Package Does Not Do

The current package intentionally does not provide:

- a legacy `DistFeat` facade class
- the old binary/tristate feature-table interface
- `grapheme2features(..., t_values=False)` style `+/-/0` rendering
- vector output modes for feature tables or matrices
- a command-line interface
- ML-based distance training

The current public API is built around categorical feature bundles, native
multi-state feature tables, scalar dimensions for the `distinctive` system,
and analysis helpers over those representations.

## Documentation

- [docs/index.md](docs/index.md) for the package overview
- [docs/api.md](docs/api.md) for the public API
- [docs/datasets.md](docs/datasets.md) for dataset loading
- [docs/systems.md](docs/systems.md) for built-in systems
- [docs/recipes.md](docs/recipes.md) for task-oriented workflows
- [docs/development.md](docs/development.md) for implementation constraints

## Relationship to alteruphono

`alteruphono` should be treated as a consumer of `distfeat`, not the owner of
the feature subsystem.
