Metadata-Version: 2.1
Name: datarules
Version: 0.0.2
Summary: Rules for validating and correcting datasets
Author-email: lverweijen <lauwerund@gmail.com>
License: Apache License 2.0
Project-URL: Homepage, https://github.com/lverweijen/datarules
Project-URL: Repository, https://github.com/lverweijen/datarules
Project-URL: Issues, https://github.com/lverweijen/datarules/issues
Project-URL: Changelog, https://github.com/lverweijen/datarules/blob/main/changes.md
Keywords: rules,validation,checks,correction,data-editing,data-cleaning,data-cleansing
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas >=1.5.0
Provides-Extra: extras
Requires-Dist: pyaml ; extra == 'extras'

# pymodify

## Goal and motivation

The idea of this project is to define rules to validate and correct datasets.
Whenever possible, it does this in a vectorized way, which makes this library fast.


Reasons to make this:
- Implement the whole data pipeline in a single language (python).
No need to call subprocess or http to send your data to R and back.
- Directly use pandas and all other python packages you are already familiar with. No need to relearn how everything is done in R.
- Validation can be fast if vectorized.

## Usage

This package provides two operations on data:

- checks (if data is correct). Also knows as validations.
- corrections (how to fix incorrect data)

### Checks

In checks.py

```python
from datarules import check


@check(tags=["P1"])
def check_almost_square(width, height):
    return (width - height).abs() < 5


@check(tags=["P3", "completeness"])
def check_not_too_deep(depth):
    return depth < 3
```

In your main code:

```python
import pandas as pd
from datarules import load_checks, Runner

df = pd.DataFrame([
    {"width": 3, "height": 7},
    {"width": 3, "height": 5, "depth": 1},
    {"width": 3, "height": 8},
    {"width": 3, "height": 3},
    {"width": 3, "height": -2, "depth": 4},
])

checks = load_checks('checks.py')
report = Runner().check(df, checks)
print(report)
```

Output:
```
                  name                           condition  items  passes  fails  NAs error  warnings
0  check_almost_square  check_almost_square(width, height)      5       3      2    0  None         0
1   check_not_too_deep           check_not_too_deep(depth)      5       1      4    0  None         0

```

### Corrections

In corrections.py

```python
from datarules import correction
from checks import check_almost_square


@correction(condition=check_almost_square.fails)
def make_square(width, height):
    return {"height": height + (width - height) / 2}
```

In your main code:

```python
from datarules import load_corrections

corrections = load_corrections('corrections.py')
report = Runner().correct(df, corrections)
print(report)
```

Output:
```
          name                                 condition                      action  applied error  warnings
0  make_square  check_almost_square.fails(width, height)  make_square(width, height)        2  None         0
```

## Similar work (python)

These work on pandas:

- [Pandera](https://pandera.readthedocs.io/en/stable/index.html) - A good alternative for validation only. Like us, their checks are vectorized too.
- [Pandantic](https://github.com/wesselhuising/pandantic) - A combination of validation and parsing based on [pydantic](https://docs.pydantic.dev/latest/).

The following offer validation only, but none of them seem to be vectorized or support pandas directly.

- [Great Expectations](https://github.com/great-expectations/great_expectations) - An overengineered library for validation that has confusing documentation.
- [contessa](https://github.com/kiwicom/contessa) - Meant to be used against databases.
- [validator](https://github.com/CSenshi/Validator)
- [python-valid8](https://github.com/smarie/python-valid8)
- [pyruler](https://github.com/danteay/pyruler) - Dead project that is rule-based.
- [pyrules](https://github.com/miraculixx/pyrules) - Dead project for corrections.

## Similar work (R)

This project is inspired by https://github.com/data-cleaning/.
Similar functionality can be found in the following R packages:

- [validate](https://github.com/data-cleaning/validate)
- [dcmodify](https://github.com/data-cleaning/dcmodify)
- [errorlocate](https://github.com/data-cleaning/errorlocate)
- [deductive](https://github.com/data-cleaning/deductive)

Features found in one of the packages above but not implemented here, might eventually make it into this package too.
