Metadata-Version: 2.4
Name: punmunch
Version: 1.0.0
Summary: High-performance Python implementation of hunspell's unmunch tool for morphological word expansion
Author-email: Vlatko Kosturjak <vlatko.kosturjak@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/kost/punmunch/gi
Project-URL: Repository, https://github.com/kost/punmunch/gi.git
Project-URL: Issues, https://github.com/kost/punmunch/gi/issues
Keywords: hunspell,morphology,linguistics,nlp,spell-checking,word-expansion
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: regex>=2022.1.18
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=22.0; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: flake8>=6.0; extra == "dev"
Dynamic: license-file

# Punmunch

Expand dictionary words using morphological affix rules in Python.

High-performance Python implementation of hunspell's unmunch tool for morphological word expansion.

## Features

- **Fast morphological expansion** - Expand dictionary root words to all possible surface forms
- **Advanced expansion modes** - Conservative, balanced, and deep expansion with configurable recursion
- **Intermediate flag processing** - Handles complex rules like "word0/yoc" with embedded flags
- **Hunspell compatibility** - Supports standard hunspell .aff and .dic file formats
- **Multiple flag types** - Handles char, long, and numeric flag formats
- **Command-line tool** - Easy-to-use CLI with `--expand` mode and advanced options
- **Python library** - Clean API for programmatic use in applications
- **Type hints** - Full type annotation support for better development experience
- **Comprehensive error handling** - Clear error messages for debugging affix files

## Installation

```bash
pip install punmunch
```

## Command Line Usage

### Full Dictionary Expansion (Unmunch)

Expand all root words in a dictionary to their surface forms:

```bash
punmunch de.aff de.dic > expanded_german.txt
```

### Interactive Word Expansion (`--expand` mode)

Expand specific words by reading from stdin:

```bash
# Expand specific German words
echo -e "Haus\nAuto" | punmunch --expand de.aff

# Expand with dictionary validation
echo -e "Haus\nAuto" | punmunch --expand de.aff de.dic --validate

# Deep expansion for maximum morphological coverage
echo -e "Haus\nAuto" | punmunch --expand --deep de.aff

# Balanced expansion (good performance/completeness trade-off)
echo -e "Haus\nAuto" | punmunch --expand --balanced de.aff

# Expand Croatian words
echo -e "kuća\nkava" | punmunch --expand hunspell-hr/hr_HR.aff hunspell-hr/hr_HR.dic
```

### Advanced Expansion Modes

```bash
# Deep expansion for maximum morphological coverage
punmunch --deep de.aff de.dic > deep_expanded.txt

# Balanced expansion (good performance/completeness trade-off)
punmunch --balanced de.aff de.dic > balanced_expanded.txt

# Conservative expansion (default, fastest)
punmunch de.aff de.dic > conservative_expanded.txt
```

### Show Statistics

```bash
punmunch --stats de.aff de.dic
```

## Python Library Usage

### Basic Usage

```python
from punmunch import Punmunch

# Initialize
punmunch = Punmunch()

# Load affix and dictionary files
punmunch.load_affix_file("de.aff")
punmunch.load_dictionary("de.dic")

# Expand entire dictionary
all_forms = punmunch.unmunch_all()
print(f"Generated {len(all_forms)} word forms")

# Expand specific words
words = punmunch.expand_words(["Haus", "Auto"])
print("Expanded forms:", words)
```

### Advanced Expansion Options

```python
from punmunch import Punmunch, ExpansionOptions

# Conservative expansion (default, fastest)
punmunch_conservative = Punmunch(ExpansionOptions.conservative())

# Deep expansion with maximum morphological coverage
punmunch_deep = Punmunch(ExpansionOptions.deep())
punmunch_deep.load_affix_file("de.aff")
punmunch_deep.load_dictionary("de.dic")
deep_forms = punmunch_deep.unmunch_all_deep()

# Balanced expansion (good trade-off)
punmunch_balanced = Punmunch(ExpansionOptions.balanced())
balanced_forms = punmunch_balanced.unmunch_all_balanced()

# Custom expansion options
custom_options = ExpansionOptions(
    max_recursion_depth=3,
    enable_deep_recursion=True,
    process_intermediate_flags=True,
    extensive_cross_products=False,
    max_forms_per_word=15000
)
punmunch_custom = Punmunch(custom_options)
```

### Advanced Usage

```python
from punmunch import Punmunch, AffixFile, Dictionary

# Load components separately for more control
affix_file = AffixFile.load("de.aff")
dictionary = Dictionary.load("de.dic", affix_file.flag_type)

punmunch = Punmunch()
punmunch.load_affix_file("de.aff")
punmunch.load_dictionary("de.dic")

# Expand without dictionary validation (like --expand mode)
expanded = punmunch.expand_words(["Haus"], use_dictionary=False)

# Check if a word is valid
is_valid = punmunch.is_valid_word("Häuser")

# Find root words for a surface form
roots = punmunch.get_root_words("Häuser")

# Get comprehensive statistics including expansion settings
stats = punmunch.get_statistics()
print(f"Dictionary has {stats['word_count']} root words")
print(f"Expansion mode: {stats['expansion_mode']}")
print(f"Max recursion depth: {stats['max_recursion_depth']}")
print(f"Process intermediate flags: {stats['process_intermediate_flags']}")
```

### Working with Individual Components

```python
from punmunch import AffixFile, Dictionary, WordExpander, ExpansionOptions

# Load affix file
affix_file = AffixFile.load("de.aff")
print(f"Loaded {len(affix_file.get_all_flags())} flags")

# Load dictionary
dictionary = Dictionary.load("de.dic", affix_file.flag_type)
print(f"Loaded {dictionary.get_word_count()} words")

# Create expander with custom options
deep_options = ExpansionOptions.deep()
expander = WordExpander(affix_file, dictionary, deep_options)

# Expand specific word with its flags
word = "Haus"
flags = dictionary.get_flags(word)
expanded = expander.expand_word(word, flags)
print(f"'{word}' expands to: {expanded}")

# Use simple expansion (without dictionary flags)
simple_expanded = expander.expand_word_simple(word)
print(f"'{word}' simple expansion: {simple_expanded}")
```

## Performance

Based on benchmarks with test dictionaries (September 2024):

### Dictionary Unmunching (Full Expansion)
- **German**: 11,193 words/second, 6.89x expansion ratio (75,888 → 523,296 words)
- **Croatian**: 10,040 words/second, 21.46x expansion ratio (53,712 → 1,151,986 words)

### Word Expansion (`--expand` mode)
- **German (basic)**: 56.8 words/second (710 forms from 10 words)
- **German (deep)**: 46.5 words/second (695 forms from 10 words, +22% overhead)
- **Croatian (basic)**: 33.0 words/second (1,996 forms from 10 words)
- **Croatian (deep)**: 27.8 words/second (1,996 forms from 10 words)

### Expansion Mode Performance
- **Conservative**: Fastest, basic morphological expansion
- **Balanced**: ~5% performance impact, better coverage
- **Deep**: ~22% slower for --expand mode, maximum morphological coverage

Performance scales well with dictionary size and morphological complexity. Deep expansion provides more comprehensive results with moderate performance cost.

## Requirements

- Python 3.8+
- `regex` library for advanced pattern matching

## Development

```bash
# Clone repository
git clone https://github.com/kost/punmunch.git
cd punmunch

# Install in development mode
pip install -e .[dev]

# Run tests
pytest

# Run type checking
mypy punmunch

# Format code
black punmunch
isort punmunch
```

## Testing Data

The package includes support for testing with:
- German dictionary (`de.aff`, `de.dic`) - 75,888 words
- Croatian dictionary (`hunspell-hr/hr_HR.aff`, `hr_HR.dic`) - 53,712 words

## License

MIT License - permissive open source license allowing commercial and private use.

## Contributing

Contributions are welcome!

## Related Projects

- [hunspell](https://hunspell.github.io/) - The original spell checker
- [runmunch](https://github.com/kost/runmunch) - Rust implementation
- [Lingua::Spelling::Alternative](https://metacpan.org/pod/Lingua::Spelling::Alternative) - Perl implementation
