Metadata-Version: 2.1
Name: oagdedupe
Version: 0.1.0
Summary: oagdedupe is a Python library for scalable entity resolution, using active learning to learn blocking configurations, generate comparison pairs, then clasify matches.
Home-page: https://github.com/chansooligans/oagdedupe
Keywords: dedupe,entity resolution,record linkage,blocking
Author: Chansoo Song
Requires-Python: >=3.8,<3.11
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Provides-Extra: book
Requires-Dist: Faker (>=13.15.1,<14.0.0)
Requires-Dist: SQLAlchemy (>=1.4.39,<2.0.0)
Requires-Dist: Sphinx (>=5.1.1,<6.0.0,!=5.2.0.post0)
Requires-Dist: autodocsumm (>=0.2.9,<0.3.0); extra == "book"
Requires-Dist: diagrams (>=0.21.1,<0.22.0)
Requires-Dist: fastapi[all] (>=0.79.0,<0.80.0)
Requires-Dist: flake8 (>=4.0.1,<5.0.0)
Requires-Dist: graphviz (>=0.19.0,<0.20.0)
Requires-Dist: ipykernel (>=6.13.0,<7.0.0)
Requires-Dist: jellyfish (>=0.9.0,<0.10.0)
Requires-Dist: jupytext (>=1.14.1,<2.0.0); extra == "book"
Requires-Dist: matplotlib (>=3.5.1,<4.0.0)
Requires-Dist: modAL (>=0.4.1,<0.5.0)
Requires-Dist: myst-parser (>=0.18.0,<0.19.0)
Requires-Dist: nbconvert (>=6.5.1,<7.0.0)
Requires-Dist: networkx (>=2.8,<3.0)
Requires-Dist: numpy (>=1.22.1,<2.0.0)
Requires-Dist: pandas (>=1.4.2,<2.0.0)
Requires-Dist: pathos (>=0.2.9,<0.3.0)
Requires-Dist: protobuf (>=3.20.2,<4.0.0)
Requires-Dist: psycopg2-binary (>=2.9.3,<3.0.0)
Requires-Dist: pydantic (>=1.10.2,<2.0.0)
Requires-Dist: pytest (>=7.1.2,<8.0.0)
Requires-Dist: ray (>=1.13.0,<2.0.0)
Requires-Dist: scikit-learn (>=1.0.2,<2.0.0)
Requires-Dist: seaborn (>=0.11.2,<0.12.0)
Requires-Dist: sphinx-rtd-theme (>=1.0.0,<2.0.0)
Requires-Dist: streamlit (>=1.11.1,<2.0.0)
Requires-Dist: streamlit-aggrid (>=0.2.3,<0.3.0)
Requires-Dist: tqdm (>=4.58.0,<5.0.0)
Project-URL: Documentation, https://deduper.readthedocs.io/en/latest/
Project-URL: Repository, https://github.com/chansooligans/oagdedupe
Description-Content-Type: text/markdown

# oagdedupe  

oagdedupe is a Python library for scalable entity resolution, using active 
learning to learn blocking configurations, generate comparison pairs, 
then clasify matches. 

## page contents
- [Documentation](#documentation)
- [Installation](#installation)
    - [label-studio](#label-studio)
    - [postgres](#postgres)
    - [project settings](#project-settings)
- [dedupe](#dedupe-example)
- [record-linkage](#record-linkage-example)
    
# Documentation<a name="#documentation"></a>

You can find the documentation of oagdedupe at https://deduper.readthedocs.io/en/latest/, 
where you can find the [api reference](https://deduper.readthedocs.io/en/latest/dedupe/api.html), 
[guide to methodology](https://deduper.readthedocs.io/en/latest/userguide/intro.html),
and [examples](https://deduper.readthedocs.io/en/latest/examples/example_dedupe.html).

# Installation<a name="#installation"></a>

[tbd pip install instructions]

## start label-studio<a name="#label-studio"></a>

Start label-studio using docker command below, updating `[LS_PORT]` to the 
port on your host machine

```
docker run -it -p [LS_PORT]:8080 -v `pwd`/cache/mydata:/label-studio/data \
	--env LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true \
	--env LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/label-studio/files \
	-v `pwd`/cache/myfiles:/label-studio/files \
	heartexlabs/label-studio:latest label-studio
```

## postgres<a name="#postgres"></a>

[insert instructions here about initializing postgres]

most importantly, need to create functions (dedupe/postgres/funcs.py)

## project settings<a name="#project-settings"></a>

Make a `dedupe.settings.Settings` object. For example:
```py
from oagdedupe.settings import (
    Settings,
    SettingsOther,
)

settings = Settings(
    name="default",  # the name of the project, a unique identifier
    folder="./.dedupe",  # path to folder where settings and data will be saved
    other=SettingsOther(
        n=5000, # active-learning samples per learning loop
        k=3, # max_len of block conjunctions
        cpus=20,  # parallelize distance computations
        attributes=["givenname", "surname", "suburb", "postcode"],  # list of entity attribute names
        path_database="postgresql+psycopg2://username:password@172.22.39.26:8000/db",  # where to save the sqlite database holding intermediate data
        db_schema="dedupe",
        path_model="./.dedupe/test_model",  # where to save the model
        label_studio={
            "port": 8089,  # label studio port
            "api_key": "83e2bc3da92741aa41c272829558c596faefa745",  # label studio port
            "description": "chansoo test project",  # label studio description of project
        },
        fast_api={"port": 8090},  # fast api port
    ),
)
settings.save()
```
To get label studio api_key:
   1. log in (can make up any user/pw).
   2. Go to "Account & Settings" using icon on top-right
   3. Get Access Token and copy/paste into settings at `settings.other.label_studio["api_key"]` 

See [dedupe/settings.py](./dedupe/settings.py) for the full settings code.

# dedupe<a name="#dedupe-example"></a>

Below is an example that dedupes `df` on attributes columns specified in settings.

## train dedupe<a name="#train-dedupe"></a>

```py
import glob
import pandas as pd
from oagdedupe.api import Dedupe

d = Dedupe(settings=settings)
d.initialize(df=df, reset=True)

# %%
# pre-processes data and stores pre-processed data, comparisons, ID matrices in SQLite db
d.fit_blocks()
```

# record-linkage<a name="#record-linkage-example"></a>

Below is an example that links `df` to `df2`, on attributes columns specified 
in settings (dataframes should share these columns).

## train record-linkage<a name="#train-record-linkage"></a>
```py
import glob
import pandas as pd
from oagdedupe.api import RecordLinkage

d = RecordLinkage(settings=settings)
d.initialize(df=df, df2=df2, reset=True)

# %%
# pre-processes data and stores pre-processed data, comparisons, ID matrices in SQLite db
d.fit_blocks()
```

# active learn<a name="#active-learn"></a>

For either dedupe or record-linkage, run:

```sh
   DEDUPER_NAME="<project name>";
   DEDUPER_FOLDER="<project folder>";
   python -m dedupe.fastapi.main
```

replacing `<project name>` and `<project folder>` with your project settings (for the example above, `test` and `./.dedupe`).

Then return to label-studio and start labelling. When the queue falls under 5 tasks, fastAPI will update the model with labelled samples then send more tasks to review.

# predictions<a name="#predictions"></a>

To get predictions, simply run the `predict()` method.

Dedupe:
```py
d = Dedupe(settings=Settings(name="test", folder="./.dedupe"))
d.predict()
```

Record-linkage:
```py
d = RecordLinkage(settings=Settings(name="test", folder="./.dedupe"))
d.predict()
```
