Metadata-Version: 2.4
Name: duckguard
Version: 2.0.0
Summary: A Python-native data quality tool with AI superpowers, built on DuckDB for speed
Project-URL: Homepage, https://github.com/duckguard/duckguard
Project-URL: Documentation, https://duckguard.dev
Project-URL: Repository, https://github.com/duckguard/duckguard
Author: DuckGuard Team
License-Expression: Elastic-2.0
License-File: LICENSE
Keywords: data-engineering,data-quality,data-validation,duckdb,testing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Requires-Dist: duckdb>=1.0.0
Requires-Dist: packaging>=21.0
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: typer>=0.9.0
Provides-Extra: all
Requires-Dist: anthropic>=0.18.0; extra == 'all'
Requires-Dist: databricks-sql-connector>=2.0.0; extra == 'all'
Requires-Dist: google-cloud-bigquery>=3.0.0; extra == 'all'
Requires-Dist: kafka-python>=2.0.0; extra == 'all'
Requires-Dist: openai>=1.0.0; extra == 'all'
Requires-Dist: oracledb>=1.0.0; extra == 'all'
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'all'
Requires-Dist: pymongo>=4.0.0; extra == 'all'
Requires-Dist: pymysql>=1.0.0; extra == 'all'
Requires-Dist: pyodbc>=4.0.0; extra == 'all'
Requires-Dist: redshift-connector>=2.0.0; extra == 'all'
Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'all'
Provides-Extra: bigquery
Requires-Dist: google-cloud-bigquery>=3.0.0; extra == 'bigquery'
Provides-Extra: databases
Requires-Dist: databricks-sql-connector>=2.0.0; extra == 'databases'
Requires-Dist: google-cloud-bigquery>=3.0.0; extra == 'databases'
Requires-Dist: kafka-python>=2.0.0; extra == 'databases'
Requires-Dist: oracledb>=1.0.0; extra == 'databases'
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'databases'
Requires-Dist: pymongo>=4.0.0; extra == 'databases'
Requires-Dist: pymysql>=1.0.0; extra == 'databases'
Requires-Dist: pyodbc>=4.0.0; extra == 'databases'
Requires-Dist: redshift-connector>=2.0.0; extra == 'databases'
Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'databases'
Provides-Extra: databricks
Requires-Dist: databricks-sql-connector>=2.0.0; extra == 'databricks'
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: kafka
Requires-Dist: kafka-python>=2.0.0; extra == 'kafka'
Provides-Extra: llm
Requires-Dist: anthropic>=0.18.0; extra == 'llm'
Requires-Dist: openai>=1.0.0; extra == 'llm'
Provides-Extra: mongodb
Requires-Dist: pymongo>=4.0.0; extra == 'mongodb'
Provides-Extra: mysql
Requires-Dist: pymysql>=1.0.0; extra == 'mysql'
Provides-Extra: oracle
Requires-Dist: oracledb>=1.0.0; extra == 'oracle'
Provides-Extra: postgres
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'postgres'
Provides-Extra: redshift
Requires-Dist: redshift-connector>=2.0.0; extra == 'redshift'
Provides-Extra: snowflake
Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'snowflake'
Provides-Extra: sqlserver
Requires-Dist: pyodbc>=4.0.0; extra == 'sqlserver'
Description-Content-Type: text/markdown

# DuckGuard

Data quality that just works. Python-native, DuckDB-powered, 10x faster.

[![PyPI version](https://badge.fury.io/py/duckguard.svg)](https://badge.fury.io/py/duckguard)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: Elastic-2.0](https://img.shields.io/badge/License-Elastic--2.0-blue.svg)](https://www.elastic.co/licensing/elastic-license)

```bash
pip install duckguard
```

## 60-Second Demo

```bash
# CLI - instant data quality check
duckguard check data.csv

# Auto-generate validation rules
duckguard discover data.csv --output duckguard.yaml
```

```python
# Python - feels like pytest
from duckguard import connect

orders = connect("data/orders.csv")

assert orders.row_count > 0
assert orders.customer_id.null_percent < 5
assert orders.amount.between(0, 10000)
assert orders.status.isin(['pending', 'shipped', 'delivered'])
```

## Key Features

| Feature | Description |
|---------|-------------|
| **Quality Scoring** | Get A-F grades for your data |
| **YAML Rules** | Define checks in simple YAML files |
| **Semantic Detection** | Auto-detect emails, phones, SSNs, PII |
| **Data Contracts** | Schema + SLAs with breaking change detection |
| **Anomaly Detection** | Z-score, IQR, and percent change methods |
| **pytest Integration** | Data tests alongside unit tests |

## Quick Examples

### Quality Score
```python
quality = orders.score()
print(f"Grade: {quality.grade}")  # A, B, C, D, or F
```

### YAML Rules
```yaml
# duckguard.yaml
dataset: orders
rules:
  - order_id is not null
  - order_id is unique
  - amount >= 0
  - status in ['pending', 'shipped', 'delivered']
```

```python
from duckguard import load_rules, execute_rules
result = execute_rules(load_rules("duckguard.yaml"), dataset=orders)
```

### PII Detection
```python
from duckguard.semantic import SemanticAnalyzer
analysis = SemanticAnalyzer().analyze(orders)
print(f"PII found: {analysis.pii_columns}")
```

### Anomaly Detection
```python
from duckguard import detect_anomalies
report = detect_anomalies(orders, method="zscore")
```

### Data Contracts
```python
from duckguard import generate_contract, validate_contract
contract = generate_contract(orders)
result = validate_contract(contract, new_orders)
```

## Supported Sources

**Files:** CSV, Parquet, JSON, Excel
**Cloud:** S3, GCS, Azure Blob
**Databases:** PostgreSQL, MySQL, SQLite, Snowflake, BigQuery, Redshift, Databricks, SQL Server, Oracle, MongoDB
**Formats:** Delta Lake, Apache Iceberg

```python
# Connect to anything
orders = connect("s3://bucket/orders.parquet")
orders = connect("postgres://localhost/db", table="orders")
orders = connect("snowflake://account/db", table="orders")
```

## CLI Commands

```bash
duckguard check <file>       # Run quality checks
duckguard discover <file>    # Auto-generate rules
duckguard contract generate  # Create data contract
duckguard contract validate  # Validate against contract
duckguard anomaly <file>     # Detect anomalies
```

## Column Methods

```python
# Statistics
col.null_percent, col.unique_percent
col.min, col.max, col.mean, col.stddev

# Validations
col.between(0, 100)
col.matches(r'^\d{5}$')
col.isin(['a', 'b', 'c'])
col.has_no_duplicates()
```

## Performance

Built on DuckDB for speed:

| | Pandas/GX | DuckGuard |
|---|---|---|
| 1GB CSV | 45s, 4GB RAM | 4s, 200MB RAM |

## License

Elastic License 2.0 - see [LICENSE](LICENSE)
