Metadata-Version: 2.3
Name: mismo
Version: 0.3.1.dev5
Summary: The SQL/Ibis powered sklearn of record linkage.
Keywords: record linkage,entity resolution,fuzzy linking,machine learning,ibis,sql,splink,duckdb
Author: Nick Crews
Author-email: Nick Crews <nicholas.b.crews@gmail.com>
License: LGPL-3.0-or-later
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Dist: ibis-enum>=0.2.0
Requires-Dist: ibis-framework>=9.1.0
Requires-Dist: sqlglot>=25.29.0
Requires-Dist: typing-extensions>=4.0.0 ; python_full_version < '3.11'
Requires-Dist: scikit-learn>=1.5.2 ; extra == 'metrics'
Requires-Dist: postal>=1.1.7,<1.1.11 ; (sys_platform == 'darwin' and extra == 'postal') or (sys_platform == 'linux' and extra == 'postal')
Requires-Dist: en-us-address-ner-sm ; extra == 'spacy'
Requires-Dist: spacy>=3.8.2 ; extra == 'spacy'
Requires-Dist: altair>=5.0.0 ; extra == 'viz'
Requires-Dist: ipywidgets>=7.5.1 ; extra == 'viz'
Requires-Dist: solara-ui>=1.51.0 ; extra == 'viz'
Requires-Dist: anywidget>=0.9.18 ; extra == 'viz'
Requires-Python: >=3.10
Project-URL: Documentation, https://nickcrews.github.io/mismo
Project-URL: Homepage, https://github.com/NickCrews/mismo
Project-URL: Issues, https://github.com/NickCrews/mismo/issues
Project-URL: Source, https://github.com/NickCrews/mismo
Provides-Extra: metrics
Provides-Extra: postal
Provides-Extra: spacy
Provides-Extra: viz
Description-Content-Type: text/markdown

# Mismo

[![PyPI - Version](https://img.shields.io/pypi/v/mismo.svg)](https://pypi.org/project/mismo)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mismo.svg)](https://pypi.org/project/mismo)

The SQL/Ibis powered sklearn of record linkage.

Still in alpha stage. Breaking changes will happen frequently
and with no warning. Once things are more stabilized I
will come up with a stability policy. Any suggestions as
to how you want the API to look like would be greatly appreciated.
I do use this in my work, so at least I do decent job of
ensuring correctness.

-----

## Goals

Mismo tries to be the sklearn of record linkage, backed by the scalability
and power of SQL and [Ibis](https://ibis-project.org/). It is made of many small
data structures and functions, each with a well-defined and standard API
that allows them to be composed together and extended easily.
None of the other record linkage packages I have seen, such as
[Splink](https://github.com/moj-analytical-services/splink),
[Dedupe](https://www.github.com/dedupeio/dedupe), or
[Record Linkage Toolkit](https://github.com/J535D165/recordlinkage),
had all of these properties, so I decided to make my own.

See [Goals and Alternatives](https://nickcrews.github.io/mismo/concepts/goals_and_alternatives)
for a more detailed discussion of the goals of Mismo and how it compares to other
record linkage packages.

## Features
- Supports larger-than-memory datasets, executed on powerful SQL engines.
  Use DuckDB for prototyping and for jobs up to maybe ~10M records,
  or Spark or other distributed backends for larger tasks, without
  needing to change your code!
- Use the clean, strong-typed, pythonic, Dataframe APIs of [Ibis](https://ibis-project.org/).
- Small, modular functions and data structures that are easy to plug together
  and extend.
- Layered API: Use top-level APIs if your task is common enough that it is
  supported out of the box.

## Installation

[`mismo` is available on PyPI](https://pypi.org/project/mismo/).
I try to publish semver'ed releases after most changes.

If I forget to do this, then there are also[prereleases on PyPI](https://pypi.org/project/mismo/#history).
These are published every week by a github action using the HEAD commit of this repo.

You can also install directly from a branch or a specific commit from github:

```console
uv pip install "mismo[viz] @ git+https://github.com/NickCrews/mismo@<SOME-SHA-OR-BRANCH>"
```

## Examples

See the [example notebook](https://nickcrews.github.io/mismo/examples/patent_deduplication).

## Documentation

See the [documentation](https://nickcrews.github.io/mismo).

## Contributing

See the [contributing guide](https://nickcrews.github.io/mismo/contributing/).

## License

`mismo` is distributed under the terms of the
[LGPL-3.0-or-later](https://spdx.org/licenses/LGPL-3.0-or-later.html) license.
