Metadata-Version: 2.1
Name: data-raven
Version: 1.0.0
Summary: A Python framework for building data quality tests
Home-page: https://github.com/tekkabroley/data-raven
Author: Alex Broley
Author-email: alex.broley@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown

# data-raven

## Description
A toolbox of flexible database connectors and test methods used to measure data integrity of datasets and database 
tables.
* Build data quality tests which can be inserted into an existing Python script or run as a stand-alone script. 
* Send outcome notifications to messaging and logging applications. 
* Halt pipelines and raise exceptions when needed.

## Prerequisites
Python 3.6+

sqlalchemy>=1.3.19

psycopg2

pymysql

## Installing
`pip install data-raven`

## A simple data quality test script

In this example we build a script to test the `name`, `price` and `product_id` columns from the Postgres table `Orders`.
This table has the following DDL:
```buildoutcfg
create table Orders (
id int,
name varchar(50),
order_ts varchar(26),
product_id int,
price float
);
```

Here's the test script.
```buildoutcfg
import os

from dataraven.connections import PostgresConnector
from dataraven.data_quality_operators import SQLNullCheckOperator


def main():
    # initialize logging
    lazy_logger = lambda msg: print(msg + '\n')

    # database connection credentials
    user = os.environ["user"]
    password = os.environ["password"]
    host = os.environ["host"]
    dbname = os.environ["dbname"]
    port = os.environ["port"]

    # postgres database connector
    conn = PostgresConnector(user, password, host, dbname, port, logger=lazy_logger)
    dialect = "postgres"

    # test thresholds
    threshold0 = 0
    threshold1 = 0.01
    threshold5 = 0.05

    ##### TEST ORDERS TABLE #####
    # Table to be tested
    from_clause = "test_schema.Orders"

    # Conditional logic to be applied to input data
    date = "2020-09-08"
    where_clause = [f"date(order_ts) = '{date}'"]

    # Columns to be tested in target table
    columns = ("name", "product_id", "price")

    # Threshold value to be applied to each column
    threhold = {"name": threshold1, "product_id": threshold0, "price": threshold5}

    # Hard fail condition set on specific columns
    hard_fail = {"product_id": True}

    # Execute the null check test on each column in columns, on the above table
    SQLNullCheckOperator(conn, dialect, from_clause, threhold, *columns, where=where_clause, logger=lazy_logger,
                         hard_fail=hard_fail)


if __name__ == "__main__":
    main()
```

# Documentation
## Database Support
* Postgres
* MySQL

## Data Quality Tests
Data quality tests are used to measure the integrity of specified columns within a table or document. Every data 
quality test will return `'test_pass'` or `'test_fail'` depending on the given measure and threshold.

### Data Quality Operators
Each operator will log the test results using the function passed in the `logger` parameter. If no logger is found then
these log messages will be swallowed. 

Each operator has a `test_results` attribute which exposes the results from the underlying test. `test_results` is a 
`dict` object with the following structure:
```buildoutcfg
{
    COLUMN NAME: {
        "result": 'test_pass' or 'test_fail',
        "measure": THE MEASURED VALUE OF COLUMN NAME,
        "threshold": THE THRESHOLD VALUE SPECIFIED FOR TEST,
        "result_msg": TEST RESULT MESSAGE
    }
}   
```

#### SQL Operators
All SQL operators have the following required parameters:
* `conn` - The database connection object.
* `dialect` - The SQL dialect for the given database. Accepted values are `postgres` or `mysql`.
* `from_` - The schema and table name of table to be tested.
* `threshold` - The threshold specified for a given test or collection of tests. This parameter can be numeric or a 
`dict` object. If `threshold` is numeric then this value will be applied to all columns being tested by the operator. 
If `threshold` is a `dict` then each `threshold` value will be referenced by column name. All columns being passed to the
operator must have a specified threshold value. If `threshold` is a `dict` it must have the following structure:
```buildoutcfg
{
    COLUMN NAME: NUMERIC VALUE
}
```
* `columns` - The column names entered as comma separated positional arguments.

All SQL operators have the following optional parameters:
* `logger` - The logging function. If None is passed then logged messages will be swallowed.
* `where` - Conditional logic to be applied to table specified in `from_`.
* `hard_fail` - Specifies if an operator which has a test which results in `'test_fail'` should terminate the current 
process. This parameter
can be passed as a literal or a `dict` object. If `hard_fail` is set to `True` then every test being performed by the 
given operator which results in `'test_fail'` will terminate the current process. If `hard_fail` is a `dict` object then
each `hard_fail` value will be referenced by column name. Only those columns with a `hard_fail` value of `True` will 
terminate the process upon test failure. If `hard_fail` is a `dict` it must have the following structure:
```buildoutcfg
{
    COLUMN NAME: BOOLEAN VALUE
}
```
* `use_ansi` - If true then compile measure query to ANSI standards.

`SQLNullCheckOperator` - Test the proportion of null values for each column contained in `columns`. 

`SQLDuplicateCheckOperator` - Test the proportion of duplicate values for each column contained in `columns`.

`SQLSetDuplicateCheckOperator` - Test the number of duplicate values across all columns passed to the `columns`
parameter simultaniously. This measure is equivalent to counting the number of rows returned from a `SELECT DISTINCT` on 
all columns and dividing by the total number of rows.

#### CSV Operators
All CSV operators have the following required parameters:
* `from_` - The path to CSV file to be tested.
* `threshold` - Same as defined above for SQL operators.
* `columns` - the column names entered as comma separated positional arguments.

All CSV operators have the following optional parameters:
* `delimiter` -  The delimiter used to separate values specified in the file refeneced by the `from_` parameter. 
* `hard_fail` - Same as defined above for SQL operators.
* `fieldnames` - A sequence of all column names for CSV file specified in `from_` parameter. To be used if the specified
file does not have column headers.
* `reducer_kwargs` - Key word arguments passed to the measure reducer function.

`CSVNullCheckOperator` - Test the proportion of `NULL` values for each column contained in `columns`.

`CSVDuplicateCheckOperator` - Test the proportion of duplicate values for each column contained in `columns`.

`CSVSetDuplicateCheckOperator` - Test the number of duplicate values across all columns passed to the `columns`
parameter simultaniously.

#### Custom Operators
`CustomSQLDQOperator` - Executes the test passed by the `custom_test` parameter on each column contained in `columns`. 
The `CustomSQLDQOperator` class has the following required parameters:
* `conn` - The database connection object.
* `custom_test` - The SQL query to be executed. The `custom_test` query is required to return a column labeled `result` 
which takes value `'test_pass'` or `'test_fail'`. The `custom_test` query should also return columns `measure`, which
provides the measured column value, and `threshold`, which gives the threshold used in the test. If these columns are
present then these values will be logged and returned in the `test_results` attribute. If `measure` and `threshold` are
not returned by the `custom_test` query then these values will be logged as `None`, and will be given in the 
`test_results` attribute as `None`. `custom_test` can also be a query template with placeholders `{column}` and 
`{threshold}` for variable column names and threshold values.  
* `description` - The description of the data quality test being performed. The description is may contain 
placeholders `{column}` and `{threshold}` for the optional parameters `columns` and `threshold`, if they are passed
to the `CustomSQLDQOperator`. In this case then a test description will be generated for each `column` in `columns` and
for each value of `threshold`.

The `CustomSQLDQOperator` class has the following optional parameters:
* `columns` - a comma separated list of column arguments.
* `threhsold` - Same as defined above for SQL operators.
* `hard_fail` - Same as defined above for SQL operators.
* `test_desc_kwargs` - Key word arguments for formatting the test description.



