Metadata-Version: 2.4
Name: columnmatch
Version: 0.1.0
Summary: DataFrame-first fuzzy matching with character n-grams and cosine similarity.
Author-email: Mathew <mjacobconnect@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/mathewgit/columnmatch
Project-URL: Repository, https://github.com/mathewgit/columnmatch
Project-URL: Issues, https://github.com/mathewgit/columnmatch/issues
Keywords: fuzzy matching,pandas,cosine similarity,n-grams,record linkage
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: twine>=5.1; extra == "dev"
Dynamic: license-file

# matchframe

DataFrame-first fuzzy matching for Python using **character n-grams** and **cosine similarity**, implemented from scratch (no RapidFuzz or external fuzzy libraries).

## Why this package exists

Many teams need practical matching between messy text columns in spreadsheets and data pipelines. `matchframe` focuses on:

- simple API for analysts and data engineers
- transparent scoring math
- clean pandas DataFrame input/output
- easy-to-read, production-ready Python code

## Features

- Accepts one DataFrame with two columns to compare
- Uses configurable character n-grams (`ngram_size`)
- Computes cosine similarity via sparse frequency vectors (`Counter`)
- Returns top matches with score %, rank, and decision label
- Friendly input validation and clear error messages

## Installation

### From source (current project)

```bash
pip install -e .
```

### After publishing

```bash
pip install columnmatch
```

## Quick usage

```python
import pandas as pd
from matchframe import match_columns

df = pd.DataFrame(
    {
        "column1": ["apple ltd"],
        "column2": ["apple limited", "appel ltd", "microsoft uk"],
    }
)

result = match_columns(
    df,
    left_col="column1",
    right_col="column2",
    ngram_size=2,
    top_k=3,
    min_score=70,
    match_threshold=90,
    review_threshold=75,
)

print(result)
```

## Main function

`match_columns(...)` signature:

```python
match_columns(
    df,
    left_col="column1",
    right_col="column2",
    ngram_size=3,
    top_k=3,
    min_score=70,
    match_threshold=90,
    review_threshold=75,
    preprocess=True,
    remove_accents=True,
    keep_original=True,
)
```

### Parameters

- `df`: input pandas DataFrame
- `left_col`: source column to match from
- `right_col`: candidate column to match against
- `ngram_size`: character n-gram size (commonly 2 or 3)
- `top_k`: number of top matches returned per left value
- `min_score`: minimum similarity percentage to keep
- `match_threshold`: score at/above this is classified as `match`
- `review_threshold`: score at/above this and below `match_threshold` is `review`
- `preprocess`: whether to clean text before matching
- `remove_accents`: remove accents during preprocessing
- `keep_original`: keep original values in output (`True`) or processed values (`False`)

## Output columns

The result DataFrame includes:

- `left_value`
- `right_value`
- `match_score`
- `rank`
- `decision`

If preprocessing is enabled, it also includes:

- `left_processed`
- `right_processed`

## Sample input/output

Input values:

- left: `apple ltd`
- right candidates: `apple limited`, `appel ltd`, `microsoft uk`

Typical output rows:

- `apple ltd | appel ltd | 91.4 | 1 | match`
- `apple ltd | apple limited | 84.2 | 2 | review`

(Exact numbers depend on `ngram_size` and preprocessing settings.)

## How scoring works

1. Preprocess text (lowercase, trim, punctuation removal, etc.)
2. Convert each string to character n-grams
3. Count n-gram frequencies using `collections.Counter`
4. Compute cosine similarity:

\[
\text{cosine\_similarity}(x, y) = \frac{x \cdot y}{\|x\|\|y\|}
\]

5. Convert to percentage:

\[
\text{match\_score} = 100 \times \text{cosine\_similarity}
\]

## Development setup

```bash
python -m venv .venv
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
pip install -e .[dev]
pytest
python examples/basic_example.py
python -m build
```

## Roadmap

- blocking strategies for large datasets
- optional parallel scoring
- optional export helpers and richer diagnostics
- optional configurable preprocessing profiles

## Contributing

1. Fork the repo and create a feature branch.
2. Add or update tests for your change.
3. Run `pytest` locally.
4. Open a pull request with a clear description.

## License

MIT
