Metadata-Version: 2.3
Name: polars-strsim
Version: 0.2.2
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Programming Language :: Rust
Requires-Dist: polars >=0.20.0
License-File: LICENSE
Summary: Polars extension for string similarity
Keywords: polars-extension,string-similarity
Author: Jeremy Foxcroft
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Repository, https://github.com/foxcroftjn/polars-strsim
Project-URL: Issues, https://github.com/foxcroftjn/polars-strsim/issues

<a href="https://pypi.org/project/polars-strsim/">
    <img src="https://img.shields.io/pypi/v/polars-strsim.svg" alt="PyPi Latest Release"/>
</a>

# String Similarity Measures for Polars

This package provides python bindings to compute various string similarity measures directly on a polars dataframe. All string similarity measures are implemented in rust and computed in parallel.

The similarity measures that have been implemented are:

- Levenshtein
- Jaro
- Jaro-Winkler
- Jaccard
- Sørensen-Dice

Each similarity measure returns a value normalized between 0.0 and 1.0 (inclusive), where 0.0 indicates the inputs are maximally different and 1.0 means the strings are maximally similar.

## Installing the Library

### With pip

```bash
pip install polars-strsim
```

### From Source

To build and install this library from source, first ensure you have [cargo](https://doc.rust-lang.org/cargo/getting-started/installation.html) installed. You will also need maturin, which you can install via `pip install 'maturin[patchelf]'`

polars-strsim can then be installed in your current python environment by running `maturin develop --release`

## Using the Library

**Input:**

```python
import polars as pl
from polars_strsim import levenshtein, jaro, jaro_winkler, jaccard, sorensen_dice

df = pl.DataFrame(
    {
        "name_a": ["phillips", "phillips", ""        , "", None      , None],
        "name_b": ["phillips", "philips" , "phillips", "", "phillips", None],
    }
).with_columns(
    levenshtein=levenshtein("name_a", "name_b"),
    jaro=jaro("name_a", "name_b"),
    jaro_winkler=jaro_winkler("name_a", "name_b"),
    jaccard=jaccard("name_a", "name_b"),
    sorensen_dice=sorensen_dice("name_a", "name_b"),
)

print(df)
```
**Output:**
```
shape: (6, 7)
┌──────────┬──────────┬─────────────┬──────────┬──────────────┬─────────┬───────────────┐
│ name_a   ┆ name_b   ┆ levenshtein ┆ jaro     ┆ jaro_winkler ┆ jaccard ┆ sorensen_dice │
│ ---      ┆ ---      ┆ ---         ┆ ---      ┆ ---          ┆ ---     ┆ ---           │
│ str      ┆ str      ┆ f64         ┆ f64      ┆ f64          ┆ f64     ┆ f64           │
╞══════════╪══════════╪═════════════╪══════════╪══════════════╪═════════╪═══════════════╡
│ phillips ┆ phillips ┆ 1.0         ┆ 1.0      ┆ 1.0          ┆ 1.0     ┆ 1.0           │
│ phillips ┆ philips  ┆ 0.875       ┆ 0.958333 ┆ 0.975        ┆ 0.875   ┆ 0.933333      │
│          ┆ phillips ┆ 0.0         ┆ 0.0      ┆ 0.0          ┆ 0.0     ┆ 0.0           │
│          ┆          ┆ 1.0         ┆ 1.0      ┆ 1.0          ┆ 1.0     ┆ 1.0           │
│ null     ┆ phillips ┆ null        ┆ null     ┆ null         ┆ null    ┆ null          │
│ null     ┆ null     ┆ null        ┆ null     ┆ null         ┆ null    ┆ null          │
└──────────┴──────────┴─────────────┴──────────┴──────────────┴─────────┴───────────────┘
```

