Metadata-Version: 2.1
Name: DataSae
Version: 0.5.1
Summary: Data Quality Framework provides by Jabar Digital Service
Home-page: https://github.com/jabardigitalservice/DataSae
Author: pipinfitriadi@gmail.com, agistyaanugrah@gmail.com, riskaamalia.mail@gmail.com, rizky.prilian@gmail.com, misdan.wijaya07@gmail.com, fauzanmrabbani@gmail.com, hanieffatchudin@gmail.com, adibintangprada@gmail.com, firmanramadan97@gmail.com, raihanjauhari@gmail.com, pujinhidayah@gmail.com
Maintainer: dataengineering@digitalservice.id, jds.dataengineer@gmail.com
License: AGPL-3.0-only
Project-URL: Docs, https://jabardigitalservice.github.io/DataSae/
Project-URL: Repository, https://github.com/jabardigitalservice/DataSae
Keywords: data quality framework,data sae,good data,data bagus,validation,jds,jabar digital service,data engineer
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Programming Language :: Python :: 3.8
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: typer[all]
Provides-Extra: converter
Requires-Dist: pyyaml ; extra == 'converter'
Requires-Dist: fastparquet ; extra == 'converter'
Requires-Dist: openpyxl ; extra == 'converter'
Provides-Extra: gsheet
Requires-Dist: google-api-python-client ; extra == 'gsheet'
Requires-Dist: gspread ==5.12.0 ; extra == 'gsheet'
Provides-Extra: s3
Requires-Dist: minio ; extra == 's3'
Provides-Extra: sql
Requires-Dist: sqlalchemy ; extra == 'sql'
Requires-Dist: psycopg2-binary ; extra == 'sql'
Requires-Dist: mysqlclient ; extra == 'sql'
Requires-Dist: pymysql ; extra == 'sql'

<!--
Copyright (C) Free Software Foundation, Inc. All rights reserved.
Licensed under the AGPL-3.0-only License. See LICENSE in the project root
for license information.
-->

# DataSae

[![Docs](https://img.shields.io/badge/Docs-blue)](https://jabardigitalservice.github.io/DataSae/)
[![License](https://img.shields.io/github/license/jabardigitalservice/DataSae?logoColor=black&label=License&labelColor=black&color=brightgreen)](https://github.com/jabardigitalservice/DataSae/blob/main/LICENSE)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/DataSae?logo=python&label=Python&labelColor=black)](https://pypi.org/project/DataSae/)
[![PyPI - Version](https://img.shields.io/pypi/v/DataSae?logo=pypi&label=PyPI&labelColor=black)](https://pypi.org/project/DataSae/)
[![GitHub Action](https://img.shields.io/github/actions/workflow/status/jabardigitalservice/DataSae/python_docker.yaml?logo=GitHub&label=CI/CD&labelColor=black)](https://github.com/jabardigitalservice/DataSae/actions/workflows/python_docker.yaml)
[![Coverage](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/jabardigitalservice/DataSae/python-coverage-comment-action-data/endpoint.json&labelColor=black)](https://htmlpreview.github.io/?https://github.com/jabardigitalservice/DataSae/blob/python-coverage-comment-action-data/htmlcov/index.html)

Data Quality Framework provides by Jabar Digital Service

- [Converter](#converter)
  - [Data Source](#data-source)
    - [Local Computer](#local-computer)
    - [Google Spreadsheet](#google-spreadsheet)
    - [S3](#s3)
    - [SQL](#sql)
      - [MariaDB or MySQL](#mariadb-or-mysql)
      - [PostgreSQL](#postgresql)
  - [Checker for Data Quality](#checker-for-data-quality)
    - [Python Code](#python-code)
    - [Command Line Interface (CLI)](#command-line-interface-cli)

## Converter

[https://github.com/jabardigitalservice/DataSae/blob/46ef80072b98ca949084b4e1ae50bcf23d07d646/tests/data/config.json#L1-L183](https://github.com/jabardigitalservice/DataSae/blob/46ef80072b98ca949084b4e1ae50bcf23d07d646/tests/data/config.json#L1-L183)

[https://github.com/jabardigitalservice/DataSae/blob/46ef80072b98ca949084b4e1ae50bcf23d07d646/tests/data/config.yaml#L1-L120](https://github.com/jabardigitalservice/DataSae/blob/46ef80072b98ca949084b4e1ae50bcf23d07d646/tests/data/config.yaml#L1-L120)

```sh
pip install 'DataSae[converter]'
```

### Data Source

#### Local Computer

```py
from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# Local computer file to DataFrame
local = config('test_local')

df = local('path/file_name.csv', sep=',')
df = local('path/file_name.json')
df = local('path/file_name.parquet')
df = local('path/file_name.xlsx', sheet_name='Sheet1')

df = local('path/file_name.csv')  # Default: sep = ','
df = local('path/file_name.json')
df = local('path/file_name.parquet')
df = local('path/file_name.xlsx')  # Default: sheet_name = 'Sheet1'
```

#### Google Spreadsheet

[https://github.com/jabardigitalservice/DataSae/blob/4308324d066c6627936773ab2d5b990adaa60100/tests/data/creds.json#L1-L12](https://github.com/jabardigitalservice/DataSae/blob/4308324d066c6627936773ab2d5b990adaa60100/tests/data/creds.json#L1-L12)

```sh
pip install 'DataSae[converter,gsheet]'
```

```py
from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# Google Spreadsheet to DataFrame
gsheet = config('test_gsheet')
df = gsheet('Sheet1')
df = gsheet('Sheet1', 'gsheet_id')
```

#### S3

```sh
pip install 'DataSae[converter,s3]'
```

```py
from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# S3 object to DataFrame
s3 = config('test_s3')

df = s3('path/file_name.csv', sep=',')
df = s3('path/file_name.json')
df = s3('path/file_name.parquet')
df = s3('path/file_name.xlsx', sheet_name='Sheet1')

df = s3('path/file_name.csv', 'bucket_name')  # Default: sep = ','
df = s3('path/file_name.json', 'bucket_name')
df = s3('path/file_name.parquet', 'bucket_name')
df = s3('path/file_name.xlsx', 'bucket_name')  # Default: sheet_name = 'Sheet1'
```

#### SQL

```sh
pip install 'DataSae[converter,sql]'
```

> [!IMPORTANT]
> For MacOS users, if [pip install failed](https://stackoverflow.com/questions/67876857/mysqlclient-wont-install-via-pip-on-macbook-pro-m1-with-latest-version-of-big-s) at `mysqlclient`, please run this and retry to install again after that.
>
> ```sh
> brew install mysql
> ```

##### MariaDB or MySQL

```py
from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# MariaDB or MySQL to DataFrame
mariadb_or_mysql = config('test_mariadb_or_mysql')
df = mariadb_or_mysql('select 1 column_name from schema_name.table_name;')
df = mariadb_or_mysql('path/file_name.sql')
```

##### PostgreSQL

```py
from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# PostgreSQL to DataFrame
postgresql = config('test_postgresql')
df = postgresql('select 1 column_name from schema_name.table_name;')
df = postgresql('path/file_name.sql')
```

### Checker for Data Quality

#### Python Code

```py
from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# Check all data qualities on configuration
config.checker  # dict result

# Check data quality by config name
config('test_local').checker  # list of dict result
config('test_gsheet').checker  # list of dict result
config('test_s3').checker  # list of dict result
config('test_mariadb_or_mysql').checker  # list of dict result
config('test_postgresql').checker  # list of dict result
```

Example results:
[https://github.com/jabardigitalservice/DataSae/blob/46ef80072b98ca949084b4e1ae50bcf23d07d646/tests/data/checker.json#L1-L432](https://github.com/jabardigitalservice/DataSae/blob/46ef80072b98ca949084b4e1ae50bcf23d07d646/tests/data/checker.json#L1-L432)

#### Command Line Interface (CLI)

```sh
datasae --help
 
 Usage: datasae [OPTIONS] FILE_PATH
 
 Checker command.
 Creates checker result based on the configuration provided in the checker section of the data source's configuration file.
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    file_path      TEXT  The source path of the .json or .yaml file [default: None] [required]                                    │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --config-name                       TEXT  If the config name is not set, it will create all of the checker results [default: None] │
│ --yaml-display    --json-display          [default: yaml-display]                                                                  │
│ --save-to-file-path                 TEXT  [default: None]                                                                          │
│ --help                                    Show this message and exit.                                                              │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```

Example commands:

```sh
datasae DataSae/tests/data/config.yaml # Check all data qualities on configuration
datasae DataSae/tests/data/config.yaml --config-name test_local # Check data quality by config name
```

> [!TIP]
> Actually, we have example for DataSae implementation in Apache Airflow, but for now it is for private use only. Internal developer can see it at this [git repository](https://gitlab.com/jdsteam/core-data-platform/data-products/example-datasae-airflow).
