Metadata-Version: 2.1
Name: deirokay
Version: 0.7.1
Summary: A tool for data profiling and data validation
Home-page: http://gitlab.bigdata/bressanmarcos/deirokay
Author: Marcos Bressan
Author-email: marcos.bressan@bigdata.com.br
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: Jinja2 (~=3.0)
Requires-Dist: jq (~=1.2)
Requires-Dist: pandas (~=1.4)
Provides-Extra: dev
Requires-Dist: SQLAlchemy ; extra == 'dev'
Requires-Dist: apache-airflow ; extra == 'dev'
Requires-Dist: boto3 ; extra == 'dev'
Requires-Dist: psycopg2-binary ; extra == 'dev'
Requires-Dist: pyarrow ; extra == 'dev'
Requires-Dist: pytest ; extra == 'dev'
Requires-Dist: pytest-cov ; extra == 'dev'
Requires-Dist: pytest-postgresql ; extra == 'dev'
Requires-Dist: s3fs ; extra == 'dev'
Provides-Extra: s3
Requires-Dist: boto3 ; extra == 's3'
Requires-Dist: s3fs ; extra == 's3'

# Deirokay
`master`:
![pipeline status](http://gitlab.bigdata/data-engineers/deirokay/badges/master/pipeline.svg)
![test coverage](http://gitlab.bigdata/data-engineers/deirokay/badges/master/coverage.svg)
![docstr-coverage](http://gitlab.bigdata/data-engineers/deirokay/-/jobs/artifacts/master/raw/badge.svg?job=docstr-coverage)

`dev`:
![pipeline status](http://gitlab.bigdata/data-engineers/deirokay/badges/dev/pipeline.svg)
![test coverage](http://gitlab.bigdata/data-engineers/deirokay/badges/dev/coverage.svg)
![docstr-coverage](http://gitlab.bigdata/data-engineers/deirokay/-/jobs/artifacts/dev/raw/badge.svg?job=docstr-coverage)


Deirokay (*dejɾo'kaj*) is a tool for data profiling and data validation.

Deirokay separates document parsing from validation logic,
so that you can create your statements about your data
without worrying whether or not your file has been properly
parsed.

You can use Deirokay for:
- Data parsing from files (CSV, parquet, excel, or any other
pandas-compatible format);
- Data validation, via *Deirokay Statements*;
- Data profiling, which generates Deirokay Statements automatically
based on an existing file. You may use these statements later against
new documents to make sure the validation still holds for new data.

To start using Deirokay, install its package and follow
the instructions bellow.


## Installation

Install Deirokay directly from `master` branch typing in your
command line:

`pip install git+http://gitlab.bigdata/data-engineers/deirokay`

To include optional dependences for AWS S3, install:

`pip install git+http://gitlab.bigdata/data-engineers/deirokay[s3]`

If you want to be in sync with the latest (and possibly unstable) release:

`pip install git+http://gitlab.bigdata/data-engineers/deirokay@dev`

## Installation for development

If you wish to contribute for Deirokay development, maybe
you will want to install a more complete set of packages for
testing and to help you in your development.

`pip install git+http://gitlab.bigdata/data-engineers/deirokay[dev]`

## API Reference

Please, [read the docs](http://data-engineers.docs.bigdata/deirokay).

## Getting started

Suppose the following CSV file:

name | age | is_married
---|---|---
john | 55 | true
mariah | 44 |
carl | | false

A known issue when dealing with `pandas` is that some of the
datatypes can be misrepresented as another dtypes. See the
result when reading this file with `pandas`:
```
>>> import pandas
>>> pandas.read_csv('file.csv')
     name   age is_married
0    john  55.0       True
1  mariah  44.0        NaN
2    carl   NaN      False
>>> pandas.read_csv('file.csv').dtypes
name           object
age           float64
is_married     object
dtype: object
```

Although strings are correctly parsed, integers become floats
when null cell are present in the same column.
If you ever write the file back to your hard disk, you'll get
the following content:

name | age | is_married
---|---|---
john | 55.0 | True
mariah | 44.0 |
carl | | False


Now all your integers became floats!

To prevent this unexpected behavior, Deirokay requires you to
provide a JSON (or YAML) file that explicitly specifies how each column
should be parsed:

``` JSON
{
    "columns": {
        "name": {
            "dtype": "string",
            "nullable": false,
            "unique": true
        },
        "age": {
            "dtype": "integer",
            "nullable": true,
            "unique": false
        },
        "is_married": {
            "dtype": "boolean",
            "nullable": true,
            "unique": false
        }
    }
}
```

Now, import `Deirokay.data_reader` and pass the JSON/YAML file
as argument:
```
>>> from deirokay import data_reader
>>> data_reader('file.csv', options='options.json')
     name   age  is_married
0    john    55        True
1  mariah    44        <NA>
2    carl  <NA>       False
>>> data_reader('file.csv', options='options.json').dtypes
name           object
age             Int64
is_married    boolean
dtype: object
```

The `options` argument also accepts `dict` objects directly.
When parsing your file, you may also provide a set of different
arguments, which varies in function of the data types. When
passing Deirokay file options as `dict`, you may optionally import
the available data types from the `deirokay.enums.DTypes` enumeration
class.

Below you will find a list of current data types and their
supported arguments.

DTypes     | String-like alias | Supported Arguments | Default             | Argument Description
:----------|:-----------------:|:-------------------:|:-------------------:|:--------------------------------------
All DTypes | -                 | nullable            | True                | Values can be null
All DTypes | -                 | unique              | False               | Values shoud be unique
All DTypes | -                 | rename              | None                | Rename column
INTEGER    | 'integer'         | thousand_sep        | None                | Thousand separator (e.g., "1,988")
FLOAT      | 'float'           | thousand_sep        | None                | Thousand separator (e.g., "1,988")
FLOAT      | 'float'           | decimal_sep         | '.'                 | Decimal separator (e.g., "3.14")
DECIMAL    | 'decimal'         | decimal_sep         | '.'                 | Decimal separator (e.g., "3.14")
DECIMAL    | 'decimal'         | thousand_sep        | None                | Thousand separator (e.g., "1,988")
DECIMAL    | 'decimal'         | decimal_places      | None                | Decimal places (e.g., 2 for "1.25")
BOOLEAN    | 'boolean'         | truthies            | ['true', 'True']    | Values taken as True
BOOLEAN    | 'boolean'         | falsies             | ['false', 'False']  | Values taken as False
BOOLEAN    | 'boolean'         | ignore_case         | False               | Ignore case when evaluating True/False
BOOLEAN    | 'boolean'         | default_value       | None                | Value to use if not truthy nor falsy
DATETIME   | 'datetime'        | format              | '%Y-%m-%d %H:%M:%S' | Date Time format
TIME       | 'time'            | format              | '%H:%M:%S'          | Time format
DATE       | 'date'            | format              | '%Y-%m-%d'          | Date format
STRING     | 'string'          | treat_null_as       | None                | Value to replace when null (e.g., "")

Along with the specification for the columns, Deirokay options also
support specification of parameters to properly open the file.
These extra parameters are passed to `pandas.read_*` methods when
reading it. For instance, you may want to specify the separator and
the encoding for your CSV file:
``` JSON
{
    "sep": ";",
    "encoding": "iso-8859-1",
    "columns": {
        "customer_id": {
            "dtype": "integer",
            "nullable": false,
            "thousand_sep": ".",
            "unique": false
        },
        "transaction_date": {
            "dtype": "datetime",
            "format": "%Y%m%d"
        },
        "transaction_id": {
            "dtype": "integer",
            "nullable": false,
            "thousand_sep": ".",
            "unique": false
        }
    }
} 
```

Except for `columns`, all other parameters from Deirokay options are
passed to `pandas.read_*` methods when opening the file.


### Making Statements about your data

The main entity in Deirokay is called Statement. A Statement is
a form of test that is executed against your data to determine
whether it proves to be Right/True or Wrong/False. A Statement is
always evaluated against a scope, i.e., a column or a set of columns.

A set of Statements are packed together in a Validation Document -
which can be either a JSON file, an YAML file or a Python dict - and
used as a declarative language to test the properties of your data.

Here is an example of Validation Document in JSON format:
``` JSON
{
    "name": "VENDAS",
    "description": "An optional field to provide further textual information",
    "items": [
        {
            "scope": [
                "WERKS01",
                "DT_OPERACAO01"
            ],
            "statements": [
                {
                    "type": "unique",
                    "at_least_%": 90.0
                },
                {
                    "type": "not_null",
                    "at_least_%": 95.0
                }
            ]
        }
    ]
}
```

To test your data against this document, import the `deirokay.validate`
method and call it following the example below:
``` python
from deirokay import data_reader, validate

df = data_reader('file.parquet', options='options.json')
validation_result_document = validate(df,
                                      against='assertions.json',
                                      raise_exception=False)
```

The resulting validation document will present the reports for each
statement, as well as its final result: `pass` or `fail`. You may
probably want to save your validation result document by passing a path
to a folder (local or in S3) as `save_to` argument to `validate`. The
results are saved in a subfolder named after your validation document
name, and the current datetime (possibly overridden by `current_date`
argument) is used as the file name.
By default, the validation result document will be saved in the same file
format as the original validation document (you may specify another
format -- either `json` or `yaml` -- in the `save_format` argument).

Here is an example of validation result document:

``` JSON
{
    "name": "VENDAS",
    "description": "An optional field to provide further textual information",
    "items": [
        {
            "scope": [
                "WERKS01",
                "DT_OPERACAO01"
            ],
            "statements": [
                {
                    "type": "unique",
                    "at_least_%": 90.0,
                    "report": {
                        "detail": {
                            "unique_rows": 1500,
                            "unique_rows_%": 99.0
                        },
                        "result": "pass"
                    }
                },
                {
                    "type": "not_null",
                    "at_least_%": 95.0,
                    "report": {
                        "detail": {
                            "null_rows": 0,
                            "null_rows_%": 0.0,
                            "not_null_rows": 1500,
                            "not_null_rows_%": 100.0
                        },
                        "result": "pass"
                    }
                }
            ]
        }
    ]
}
```

These are some of the statement types currently supported by Deirokay:

Statement Type | Available Arguments
---|---
'unique' | 'at_least_%' | Minimum percentage of unique rows
'not_null' | 'at_least_%' | Minimum percentage of non-null rows
'not_null' | 'at_most_%' | Maximum percentage of non-null rows
'not_null' | 'multicolumn_logic' | Whether to use 'any' or 'all' when evaluating non-null rows over multiple colums
'custom' | 'location' | Location of the custom statement (e.g., "/home/file.py::MyStatement")

The following section illustrates how to create and use `custom` type
statements.

### Statement Severity Level

An optional parameter for your statements is the `severity` level,
whose value is typically an integer from 1 to 5 (although nothing
prevents you to use any other integer).
The `deirokay.enums.SeverityLevel` enumeration describes some named
levels: `MINIMAL` (1), `WARNING` (3) and `CRITICAL` (5), but you may
use any other integer in your Validation Documents instead.
When not declared, the default statement severity level is set to 5.

The `severity` parameter is particularly useful in two contexts:
- When using `deirokay.validate` function, you may want to set
`raise_exception` to `True` (default value) and specify a
`exception_level` (default is `CRITICAL` or 5). This function call will
raise a `deirokay.exceptions.ValidationError` exception whenever it
finds a failed statement whose severity level is greater or equal to
the `exception_level` variable.
The `level` attribute of `ValidationError` contains the larger severity
level found in all failed statements.
Any failed statement whose value is lesser than `exception_level` will
only raise a warning.
- When validating data in *Apache Airflow* with *DeirokayOperator*, you
may specify two exception levels: `soft_fail_level` (defaults to
`MINIMAL` or 1) and `hard_fail_level` (defaults to `CRITICAL` or 5).
The task is given the `failed` status if any failed statement severity
matches or exceeds the `hard_fail_level`, and any tasks that may depend
on its success won't be executed.
Otherwise, if the severity level for any failed statement only matches
or exceeds `soft_fail_level`, the task is given the `skipped` status,
which won't prevent the DAG to keep running.
If all failed statements have severity levels lesser than
`soft_fail_level`, only a warning will be logged by the task.

## Creating Custom Statements

Deirokay is designed to be broadly extensible. Even if your set
of rules about your data is not in the bultin set of Deirokay
statements, you can still subclass the `BaseStatement` class and
implement your own statements. If you believe your statement will be
useful for other users, we encourage you to propose it as a
Merge Request so that it can become a builtin statement in a
future release.

A custom Statement should override at least two methods from
`BaseStatement`: `report` and `result`.
The `report` method presents statistics and
measurements that may be used by the `result` method. The report is
attached to the Validation Result Document in order to compose a
detailed list of facts about your data. The `result` method receives
the report generated by `report` and returns either `True` (to signal
success) or `False` (to signal that the statement is invalid).


The code below shows an example of a custom statement:
``` python
from deirokay.statements import BaseStatement


class ThereAreValuesGreaterThanX(BaseStatement):
    # Give your statement class a name (only for completeness,
    # its name is only useful when proposing it in a Merge Request)
    name = 'there_are_values_greater_than_x'
    # Declare which parameters are valid for this statement
    expected_parameters = ['x']

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # All the arguments necessary for the statement are collected
        # from `self.options`. If they were templated arguments, they
        # should have already been rendered and you may transparently
        # use their final value in `report` and `result` methods.
        self.x = self.options.get('x')

    def report(self, df) -> dict:
        """
            Report statistics and facts about the data.
        """
        bools = df > self.x
        report = {
            'values_greater_than_x': list(bools[bools.all(axis=1)].index)
        }
        return report

    def result(self, report: dict) -> bool:
        """
            Use metrics from the report to indicate either success
            (True) or failure (False)
        """
        return len(report.get('values_greater_than_x')) > 0
```

The following Validation Document shows how to use your custom
Statement for a validation process:
``` JSON
{
        "name": "VENDAS",
        "description": "Validation using custom statement",
        "items": [
            {
                "scope": "NUM_TRANSACAO01",
                "statements": [
                    {
                        "type": "custom",
                        "location": "/home/custom_statement.py::"
                                    "ThereAreValuesGreaterThanX",
                        "x": 2
                    }
                ]
            }
        ]
    }
```

Besides the parameters necessary for your custom statement (`"x": 2` 
in the example above) and the `custom` statement type, you should pass
a `location` parameter that instructs Deirokay how to find your
statement class. There is not need for the module file to be in
current directory: your class will be magically imported by Deirokay
and used during validation process. 

The `location` parameter must follow the pattern 
`path_to_module::class_name`.

Currently, you can pass either a local path or an S3 key:
- `/home/ubuntu/my_module.py::MyStatementClass`
- `s3://my-bucket/my_statements/module_of_statements.py::Stmt`
(make sure you have `boto3` installed)

## Jinja Templating and Statements based on Past Validation Data

Some validation statements may present dynamic behaviour, maybe
folowing a natural uptrend or downtrend movement of your data. Suppose
you expect the number of rows of your data file to possibly fluctuate
+/- 3% around a 7-day moving average. Deirokay allows you to refer to
past validation data by means of a special function called `series`.

To use it, replace a static value for your statement parameter by a
templated argument, such as the following:

``` yaml
name: VENDAS
items:
- scope:
  - WERKS01
  - PROD_VENDA
  #  When the scope has more than one column or has special characters,
  #  you should provide an `alias` string to refer to this item.
  alias: werks_prod
  statements:
  - type: row_count
    min: '{{ 0.97 * (series("VENDAS", 7).werks_prod.row_count.rows.mean()
      |default(19, true)) }}'
    max: '{{ 1.03 * (series("VENDAS", 7).werks_prod.row_count.rows.mean()
      |default(21, true)) }}'
```

The YAML validation document above (could be JSON) presents some new
features. A templated Jinja argument is identified by a pair of double
braces (`{{ }}`) surrounding its content. Deirokay has a special
callable named `series` that you can use to retrieve past data as a
`pandas.Series` object.
When declared, `series` receives a validation document name and a
number of logs to look behind. Next, you should provide a path to a
statement report value, following the sequence:
scope (possibly aliased) => statement type name => statement result
metric name. This returns a `pandas.Series` object you can take any
calculation from (`mean`, `min`, `max`, etc.).

To illustrate this path, take the templated argument from the YAML
document above as an example:

``` python
series("VENDAS", 7).werks_prod.row_count.rows.mean()
```
- `series("VENDAS", 7)`: Retrieve the 7 last validation logs for
"VENDAS";
- `werks_prod`: Consider the `werks_prod`-aliased item (could be the
scope value itself if it is a string for a single column name);
- `row_count`: Use statistics from `row_count` statement;
- `rows`: Use the `rows` metric reported by the selected statement
(you should know the statistics reported by your statement).
- `mean()`: Since the previous object is already a pandas Series, this
call gets its mean value.

When your validation results have no history, this call returns `None`.
In Jinja, you may provide a default value to replace a null variable
by using `|default(<default value>, true)`.

Finally, here is an example of validation result log. Notice that the
templates are already rendered before being saved:

``` yaml
name: VENDAS
items:
- scope:
  - WERKS01
  - PROD_VENDA
  alias: werks_prod
  statements:
  - type: row_count
    min: 1490.0
    max: 1510.0
    report:
      detail:
        rows: 1500
      result: pass
```

## Data Profiling: Auto-generated Validation Document

You may generate a basic validation document by consuming a sample file.
It is recommended that you review the generated document and
supplement it with additional statements. The document can be saved in
either JSON or YAML format, depending on the extension of the file path
that you passed to optional `save_to` argument. You may also retrieve
the `dict` document from the method's return.


``` python
from deirokay import data_reader, profile


df = data_reader(
    'tests/transactions_sample.csv',
    options='tests/options.json'
)

validation_document = profile(df,
                              document_name='my-document-name',
                              save_to='my-validation-document.json')
```

## Deirokay Airflow Operator

Deirokay has its own Airflow Operator, which you can import to your DAG
to validate your data.


``` python
from datetime import datetime

from airflow.models import DAG
from deirokay.airflow import DeirokayOperator


dag = DAG(dag_id='data-validation',
          schedule_interval='@daily',
          default_args={
              'owner': 'airflow',
              'start_date': datetime(2021, 3, 2),
          })

operator = DeirokayOperator(task_id='deirokay-validate',
                            data='tests/transactions_sample.csv',
                            options='tests/options.json',
                            against='tests/assertions.json',
                            dag=dag)
```


