Metadata-Version: 2.4
Name: furlanspellchecker
Version: 0.1.1
Summary: A comprehensive spell checker for the Friulian language with CLI and pipeline service.
Project-URL: Homepage, https://github.com/daurmax/FurlanSpellChecker
Project-URL: Repository, https://github.com/daurmax/FurlanSpellChecker
Project-URL: Issues, https://github.com/daurmax/FurlanSpellChecker/issues
Project-URL: Documentation, https://github.com/daurmax/FurlanSpellChecker/blob/main/README.md
Author-email: Massimo Romanin <dev@massimoromanin.com>
License: Attribution-NonCommercial 4.0 International
        
        =======================================================================
        
        Creative Commons Corporation ("Creative Commons") is not a law firm and
        does not provide legal services or legal advice. Distribution of
        Creative Commons public licenses does not create a lawyer-client or
        other relationship. Creative Commons makes its licenses and related
        information available on an "as-is" basis. Creative Commons gives no
        warranties regarding its licenses, any material licensed under their
        terms and conditions, or any related information. Creative Commons
        disclaims all liability for damages resulting from their use to the
        fullest extent possible.
        
        Using Creative Commons Public Licenses
        
        Creative Commons public licenses provide a standard set of terms and
        conditions that creators and other rights holders may use to share
        original works of authorship and other material subject to copyright
        and certain other rights specified in the public license below. The
        following considerations are for informational purposes only, are not
        exhaustive, and do not form part of our licenses.
        
             Considerations for licensors: Our public licenses are
             intended for use by those authorized to give the public
             permission to use material in ways otherwise restricted by
             copyright and certain other rights. Our licenses are
             irrevocable. Licensors should read and understand the terms
             and conditions of the license they choose before applying it.
             Licensors should also secure all rights necessary before
             applying our licenses so that the public can reuse the
             material as expected. Licensors should clearly mark any
             material not subject to the license. This includes other CC-
             licensed material, or material used under an exception or
             limitation to copyright. More considerations for licensors:
            wiki.creativecommons.org/Considerations_for_licensors
        
             Considerations for the public: By using one of our public
             licenses, a licensor grants the public permission to use the
             licensed material under specified terms and conditions. If
             the licensor's permission is not necessary for any reason--for
             example, because of any applicable exception or limitation to
             copyright--then that use is not regulated by the license. Our
             licenses grant only permissions under copyright and certain
             other rights that a licensor has authority to grant. Use of
             the licensed material may still be restricted for other
             reasons, including because others have copyright or other
             rights in the material. A licensor may make special requests,
             such as asking that all changes be marked or described.
             Although not required by our licenses, you are encouraged to
             respect those requests where reasonable. More considerations
             for the public:
            wiki.creativecommons.org/Considerations_for_licensees
        
        =======================================================================
        
        Creative Commons Attribution-NonCommercial 4.0 International Public
        License
        
        By exercising the Licensed Rights (defined below), You accept and agree
        to be bound by the terms and conditions of this Creative Commons
        Attribution-NonCommercial 4.0 International Public License ("Public
        License"). To the extent this Public License may be interpreted as a
        contract, You are granted the Licensed Rights in consideration of Your
        acceptance of these terms and conditions, and the Licensor grants You
        such rights in consideration of benefits the Licensor receives from
        making the Licensed Material available under these terms and
        conditions.
        
        Section 1 -- Definitions.
        
          a. Adapted Material means material subject to Copyright and Similar
             Rights that is derived from or based upon the Licensed Material
             and in which the Licensed Material is translated, altered,
             arranged, transformed, or otherwise modified in a manner requiring
             permission under the Copyright and Similar Rights held by the
             Licensor. For purposes of this Public License, where the Licensed
             Material is a musical work, performance, or sound recording,
             Adapted Material is always produced where the Licensed Material is
             synched in timed relation with a moving image.
        
          b. Adapter's License means the license You apply to Your Copyright
             and Similar Rights in Your contributions to Adapted Material in
             accordance with the terms and conditions of this Public License.
        
          c. Copyright and Similar Rights means copyright and/or similar rights
             closely related to copyright including, without limitation,
             performance, broadcast, sound recording, and Sui Generis Database
             Rights, without regard to how the rights are labeled or
             categorized. For purposes of this Public License, the rights
             specified in Section 2(b)(1)-(2) are not Copyright and Similar
             Rights.
          d. Effective Technological Measures means those measures that, in the
             absence of proper authority, may not be circumvented under laws
             fulfilling obligations under Article 11 of the WIPO Copyright
             Treaty adopted on December 20, 1996, and/or similar international
             agreements.
        
          e. Exceptions and Limitations means fair use, fair dealing, and/or
             any other exception or limitation to Copyright and Similar Rights
             that applies to Your use of the Licensed Material.
        
          f. Licensed Material means the artistic or literary work, database,
             or other material to which the Licensor applied this Public
             License.
        
          g. Licensed Rights means the rights granted to You subject to the
             terms and conditions of this Public License, which are limited to
             all Copyright and Similar Rights that apply to Your use of the
             Licensed Material and that the Licensor has authority to license.
        
          h. Licensor means the individual(s) or entity(ies) granting rights
             under this Public License.
        
          i. NonCommercial means not primarily intended for or directed towards
             commercial advantage or monetary compensation. For purposes of
             this Public License, the exchange of the Licensed Material for
             other material subject to Copyright and Similar Rights by digital
             file-sharing or similar means is NonCommercial provided there is
             no payment of monetary compensation in connection with the
             exchange.
        
          j. Share means to provide material to the public by any means or
             process that requires permission under the Licensed Rights, such
             as reproduction, public display, public performance, distribution,
             dissemination, communication, or importation, and to make material
             available to the public including in ways that members of the
             public may access the material from a place and at a time
             individually chosen by them.
        
          k. Sui Generis Database Rights means rights other than copyright
             resulting from Directive 96/9/EC of the European Parliament and of
             the Council of 11 March 1996 on the legal protection of databases,
             as amended and/or succeeded, as well as other essentially
             equivalent rights anywhere in the world.
        
          l. You means the individual or entity exercising the Licensed Rights
             under this Public License. Your has a corresponding meaning.
        
        
        Section 2 -- Scope.
        
          a. License grant.
        
               1. Subject to the terms and conditions of this Public License,
                  the Licensor hereby grants You a worldwide, royalty-free,
                  non-sublicensable, non-exclusive, irrevocable license to
                  exercise the Licensed Rights in the Licensed Material to:
        
                    a. reproduce and Share the Licensed Material, in whole or
                       in part, for NonCommercial purposes only; and
        
                    b. produce, reproduce, and Share Adapted Material for
                       NonCommercial purposes only.
        
               2. Exceptions and Limitations. For the avoidance of doubt, where
                  Exceptions and Limitations apply to Your use, this Public
                  License does not apply, and You do not need to comply with
                  its terms and conditions.
        
               3. Term. The term of this Public License is specified in Section
                  6(a).
        
               4. Media and formats; technical modifications allowed. The
                  Licensor authorizes You to exercise the Licensed Rights in
                  all media and formats whether now known or hereafter created,
                  and to make technical modifications necessary to do so. The
                  Licensor waives and/or agrees not to assert any right or
                  authority to forbid You from making technical modifications
                  necessary to exercise the Licensed Rights, including
                  technical modifications necessary to circumvent Effective
                  Technological Measures. For purposes of this Public License,
                  simply making modifications authorized by this Section 2(a)
                  (4) never produces Adapted Material.
        
               5. Downstream recipients.
        
                    a. Offer from the Licensor -- Licensed Material. Every
                       recipient of the Licensed Material automatically
                       receives an offer from the Licensor to exercise the
                       Licensed Rights under the terms and conditions of this
                       Public License.
        
                    b. No downstream restrictions. You may not offer or impose
                       any additional or different terms or conditions on, or
                       apply any Effective Technological Measures to, the
                       Licensed Material if doing so restricts exercise of the
                       Licensed Rights by any recipient of the Licensed
                       Material.
        
               6. No endorsement. Nothing in this Public License constitutes or
                  may be construed as permission to assert or imply that You
                  are, or that Your use of the Licensed Material is, connected
                  with, or sponsored, endorsed, or granted official status by,
                  the Licensor or others designated to receive attribution as
                  provided in Section 3(a)(1)(A)(i).
        
          b. Other rights.
        
               1. Moral rights, such as the right of integrity, are not
                  licensed under this Public License, nor are publicity,
                  privacy, and/or other similar personality rights; however, to
                  the extent possible, the Licensor waives and/or agrees not to
                  assert any such rights held by the Licensor to the limited
                  extent necessary to allow You to exercise the Licensed
                  Rights, but not otherwise.
        
               2. Patent and trademark rights are not licensed under this
                  Public License.
        
               3. To the extent possible, the Licensor waives any right to
                  collect royalties from You for the exercise of the Licensed
                  Rights, whether directly or through a collecting society
                  under any voluntary or waivable statutory or compulsory
                  licensing scheme. In all other cases the Licensor expressly
                  reserves any right to collect such royalties, including when
                  the Licensed Material is used other than for NonCommercial
                  purposes.
        
        ... (LICENSE content truncated here; full text copied from FurlanG2P LICENSE)
        MIT License
        
        Copyright (c) 2024 Massimo Romanin
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: friulian,furlan,nlp,spellchecker,spelling,text-processing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Requires-Dist: click>=8.1.7
Requires-Dist: colorama>=0.4.6
Requires-Dist: msgpack>=1.0.0
Requires-Dist: rapidfuzz>=3.0.0
Requires-Dist: requests>=2.31.0
Provides-Extra: api
Requires-Dist: fastapi>=0.104.0; extra == 'api'
Requires-Dist: uvicorn>=0.24.0; extra == 'api'
Provides-Extra: dev
Requires-Dist: black>=24.4.0; extra == 'dev'
Requires-Dist: hypothesis>=6.0.0; extra == 'dev'
Requires-Dist: mypy>=1.10.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.5.0; extra == 'dev'
Requires-Dist: types-click>=7.1.8; extra == 'dev'
Requires-Dist: types-colorama>=0.4.15; extra == 'dev'
Description-Content-Type: text/markdown

# FurlanSpellChecker

A comprehensive spell checker for the Friulian language with CLI and pipeline service.

## Overview

FurlanSpellChecker is a Python library and command-line tool for spell checking text in the Friulian (Furlan) language. It provides a complete spell checking pipeline with dictionary management, phonetic algorithms, and text processing capabilities specifically designed for Friulian linguistic features.

## Features

- **Complete spell checking pipeline** - Tokenization, spell checking, and correction suggestions
- **Friulian-specific phonetic algorithm** - Custom phonetic similarity for better suggestions
- **Flexible dictionary system** - Support for multiple dictionaries with RadixTree optimization
- **Command-line interface** - Easy-to-use CLI for batch processing and interactive use
- **Configurable processing** - Extensive configuration options for different use cases
- **Python API** - Full programmatic access to all functionality

## Installation

### From PyPI (when available)

```bash
pip install furlanspellchecker
```

### From source

```bash
git clone https://github.com/daurmax/FurlanSpellChecker.git
cd FurlanSpellChecker
pip install -e .
```

### Development installation

```bash
git clone https://github.com/daurmax/FurlanSpellChecker.git
cd FurlanSpellChecker
pip install -e ".[dev]"
```

## Quick Start

### Command Line Usage

#### Interactive Mode (New!)

Start the interactive REPL with colored output and multilingual support:

```bash
furlanspellchecker interactive
```

Features:
- **ASCII art logo** - Beautiful Friulian-themed startup banner
- **Colored output** - Easy-to-read colored console output (requires colorama)
- **Multilingual interface** - Choose between English, Friulian (Furlan), and Italian
- **Interactive commands**:
  - `C <words>...` - Check spelling of one or more words
  - `S <word>` - Get suggestions for a misspelled word
  - `Q` - Quit the application

Options:
```bash
# Specify language directly (skip selection prompt)
furlanspellchecker interactive --language fur  # Friulian
furlanspellchecker interactive --language it   # Italian
furlanspellchecker interactive --language en   # English

# Disable colored output
furlanspellchecker interactive --no-color
```

Example session:
```
> C preon lenghe
preon is correct
lenghe is correct

> S preo
preo is incorrect
Suggestions are: preon, pren, predi

> Q
Closing the application. Goodbye!
```

#### COF Protocol Mode (for automation)

For automation and testing compatibility with the Perl COF implementation:

```bash
# Read commands from stdin
echo -e "c preon\ns sbaliât\nq" | furlanspellchecker cof-cli

# With options
furlanspellchecker cof-cli --encoding utf8 --max-suggestions 5
```

Protocol commands:
- `c <word> [<word2> ...]` - Check spelling (returns `ok\n` or `no\n`)
- `s <word>` - Get suggestions (returns `ok\n` or `no\t<sug1>,<sug2>,...\n`)
- `q` - Quit

This mode ensures 100% output format compatibility with the original Perl COF CLI for integration with existing tools and test suites.

#### Database Management

Download dictionary databases:
```bash
furlanspellchecker download-dicts
```

Check database status:
```bash
furlanspellchecker db-status
```

Extract local ZIP files:
```bash
furlanspellchecker extract-dicts /path/to/zipfile.zip
```

#### Standard Commands

Check a single word:
```bash
furlanspellchecker lookup "cjase"
```

Get suggestions for a misspelled word:
```bash
furlanspellchecker suggest "cjasa"
```

Check text from a file:
```bash
furlanspellchecker file input.txt -o corrected.txt
```

### Python API Usage

```python
import asyncio
from furlan_spellchecker import SpellCheckPipeline

# Initialize the spell checker
pipeline = SpellCheckPipeline()

# Check text
result = pipeline.check_text("Cheste e je une frâs in furlan.")
print(f"Incorrect words: {result['incorrect_count']}")

# Check a single word
async def check_word():
    word_result = await pipeline.check_word("furlan")
    print(f"'{word_result['word']}' is {'correct' if word_result['is_correct'] else 'incorrect'}")

asyncio.run(check_word())
```

## Architecture

FurlanSpellChecker is organized as a set of modular components:

| Module | Responsibility |
|--------|----------------|
| `core` | Abstract interfaces, exceptions, and type definitions |
| `entities` | Data structures for processed text elements |
| `spellchecker` | Main spell checking logic and text processing |
| `dictionary` | Dictionary management and RadixTree implementation |
| `database` | Database access, download management, and caching |
| `phonetic` | Friulian-specific phonetic algorithm |
| `services` | High-level pipeline and I/O services |
| `config` | Configuration schemas and management |
| `cli` | Command-line interface |
| `data` | Packaged dictionary data |

## Configuration

The spell checker can be configured through configuration files or programmatically:

```python
from furlan_spellchecker import FurlanSpellCheckerConfig, DictionaryConfig

config = FurlanSpellCheckerConfig(
    dictionary=DictionaryConfig(
        max_suggestions=5,
        use_phonetic_suggestions=True
    )
)
```

## Database Files

FurlanSpellChecker uses database files for dictionary lookups, word frequencies, elisions, and error corrections. These files are **automatically downloaded** from GitHub Releases on first use.

### Automatic Download

When you first use the spell checker, it will automatically download the required database files (~63 MB) and cache them locally in:
- **Windows**: `C:\Users\<username>\.cache\furlan_spellchecker\databases`
- **Linux/Mac**: `~/.cache/furlan_spellchecker/databases`

No manual intervention required! 🎉

### Database Contents

| Database | Size | Description |
|----------|------|-------------|
| `words.sqlite` | ~289 MB | Phonetic dictionary (7.4M phonetic hashes, 10.1M words) |
| `frequencies.sqlite` | ~2 MB | Word frequency data (69,051 words) for suggestion ranking |
| `elisions.sqlite` | ~0.2 MB | Elision rules (10,604 words) |
| `errors.sqlite` | ~0.01 MB | Common error corrections (301 patterns) |
| `words_radix_tree.rt` | ~9.7 MB | RadixTree for fast word lookups |

**Total**: ~300 MB (SQLite + binary formats)

### Manual Download (Optional)

If you prefer to download manually or work offline:

1. Download from: [Latest Database Release](https://github.com/daurmax/FurlanSpellChecker/releases/tag/0.0.2-dictionaries-sqlite)
2. Extract ZIP files to cache directory
3. The spell checker will use the cached files

### For Contributors: Creating Database Releases

If you need to create a new database release (e.g., after updating word lists):

```bash
# Install dependencies
pip install PyGithub

# Set GitHub token
$env:GITHUB_TOKEN = "your_token_here"

# Create release
python scripts/create_database_release.py --tag v1.1.0-databases
```

See [Database Release Guide](docs/development/GitHub_Release_Instructions.md) for detailed instructions.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

### Development Setup

1. Clone the repository
2. Install development dependencies: `pip install -e ".[dev]"`
3. Run tests: `pytest`
   - Run specific test modules: `pytest tests/test_radix_tree.py -v`
   - Run performance tests: `pytest tests/test_radix_tree.py -m slow -v`
   - Skip slow tests: `pytest -m "not slow"`
4. Run linting: `ruff check src tests`
5. Run type checking: `mypy src`

### Test Suite

The project includes comprehensive test coverage with special focus on:

- **COF Compatibility**: RadixTree tests ensure 1:1 compatibility with original COF implementation
- **Edge Case Testing**: Comprehensive handling of empty input, special characters, and invalid data
- **Performance Testing**: Batch processing and stress testing for production readiness
- **Integration Testing**: End-to-end testing with DatabaseManager and other components

**RadixTree Test Coverage** (24 tests total):
- **COF Compatibility** (13 tests): Core suggestion matching with verified test cases
- **Edge Cases** (7 tests): Friulian-specific character handling (cjàse, furlanâ, çi)
- **Performance** (2 tests): Batch processing and stress testing benchmarks  
- **Integration** (2 tests): DatabaseManager integration and availability checks

## License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Based on the original C# implementation in CoretorOrtograficFurlan-Core
- Inspired by the architecture of FurlanG2P
- Dictionary data sourced from Friulian linguistic resources

## Related Projects

- [CoretorOrtograficFurlan-Core](https://github.com/daurmax/CoretorOrtograficFurlan-Core) - Original C# implementation
- [FurlanG2P](https://github.com/daurmax/FurlanG2P) - Friulian grapheme-to-phoneme conversion