Metadata-Version: 2.4
Name: truthound
Version: 3.0.0
Summary: Zero-Configuration Data Quality Framework Powered by Polars
Project-URL: Homepage, https://github.com/seadonggyun4/Truthound
Project-URL: Repository, https://github.com/seadonggyun4/Truthound
Project-URL: Issues, https://github.com/seadonggyun4/Truthound/issues
Author-email: seadonggyun4 <seadonggyun4@gmail.com>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: data-masking,data-quality,data-validation,pii-detection,polars
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: polars>=1.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: typer>=0.12.0
Provides-Extra: all
Requires-Dist: aiokafka>=0.9.0; extra == 'all'
Requires-Dist: azure-storage-blob>=12.0.0; extra == 'all'
Requires-Dist: boto3>=1.26.0; extra == 'all'
Requires-Dist: duckdb<2.0.0,>=1.0.0; extra == 'all'
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'all'
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'all'
Requires-Dist: great-expectations<2.0.0,>=1.15.0; extra == 'all'
Requires-Dist: jinja2>=3.0.0; extra == 'all'
Requires-Dist: motor>=3.0.0; extra == 'all'
Requires-Dist: pandas<3.0.0,>=2.0.0; extra == 'all'
Requires-Dist: psutil<8.0.0,>=5.9.0; extra == 'all'
Requires-Dist: reflex>=0.4.0; extra == 'all'
Requires-Dist: scikit-learn>=1.3.0; extra == 'all'
Requires-Dist: scipy>=1.10.0; extra == 'all'
Requires-Dist: sqlalchemy<3.0.0,>=2.0.0; extra == 'all'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'all'
Requires-Dist: weasyprint>=60.0; extra == 'all'
Requires-Dist: xxhash>=3.4.0; extra == 'all'
Provides-Extra: anomaly
Requires-Dist: scikit-learn>=1.3.0; extra == 'anomaly'
Requires-Dist: scipy>=1.10.0; extra == 'anomaly'
Provides-Extra: async-datasources
Requires-Dist: aiokafka>=0.9.0; extra == 'async-datasources'
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'async-datasources'
Requires-Dist: motor>=3.0.0; extra == 'async-datasources'
Provides-Extra: azure
Requires-Dist: azure-storage-blob>=12.0.0; extra == 'azure'
Provides-Extra: benchmarks
Requires-Dist: duckdb<2.0.0,>=1.0.0; extra == 'benchmarks'
Requires-Dist: great-expectations<2.0.0,>=1.15.0; extra == 'benchmarks'
Requires-Dist: pandas<3.0.0,>=2.0.0; extra == 'benchmarks'
Requires-Dist: psutil<8.0.0,>=5.9.0; extra == 'benchmarks'
Requires-Dist: sqlalchemy<3.0.0,>=2.0.0; extra == 'benchmarks'
Provides-Extra: dashboard
Requires-Dist: reflex>=0.4.0; extra == 'dashboard'
Provides-Extra: database
Requires-Dist: sqlalchemy>=2.0.0; extra == 'database'
Provides-Extra: dev
Requires-Dist: mypy>=1.10.0; extra == 'dev'
Requires-Dist: pandas<3.0.0,>=2.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Requires-Dist: scikit-learn>=1.3.0; extra == 'dev'
Requires-Dist: scipy>=1.10.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5.0; extra == 'docs'
Requires-Dist: mkdocs>=1.6.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'docs'
Requires-Dist: pymdown-extensions>=10.8.0; extra == 'docs'
Provides-Extra: drift
Requires-Dist: scipy>=1.10.0; extra == 'drift'
Provides-Extra: duckdb
Requires-Dist: duckdb>=1.0.0; extra == 'duckdb'
Provides-Extra: elasticsearch
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'elasticsearch'
Provides-Extra: gcs
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'gcs'
Provides-Extra: kafka
Requires-Dist: aiokafka>=0.9.0; extra == 'kafka'
Provides-Extra: mongodb
Requires-Dist: motor>=3.0.0; extra == 'mongodb'
Provides-Extra: nosql
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'nosql'
Requires-Dist: motor>=3.0.0; extra == 'nosql'
Provides-Extra: pdf
Requires-Dist: weasyprint>=60.0; extra == 'pdf'
Provides-Extra: perf
Requires-Dist: xxhash>=3.4.0; extra == 'perf'
Provides-Extra: reports
Requires-Dist: jinja2>=3.0.0; extra == 'reports'
Provides-Extra: s3
Requires-Dist: boto3>=1.26.0; extra == 's3'
Provides-Extra: stores
Requires-Dist: azure-storage-blob>=12.0.0; extra == 'stores'
Requires-Dist: boto3>=1.26.0; extra == 'stores'
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'stores'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'stores'
Provides-Extra: streaming
Requires-Dist: aiokafka>=0.9.0; extra == 'streaming'
Description-Content-Type: text/markdown

<div align="center">
  <img width="500" alt="Truthound Banner" src="docs/assets/truthound_banner.png" />
</div>

<h1 align="center">Truthound</h1>

<p align="center">
  <strong>Zero-Configuration Data Quality Framework Powered by Polars</strong>
</p>

<p align="center">
  <em>Sniffs out bad data.</em>
</p>

<p align="center">
  <a href="https://truthound.netlify.app/"><img src="https://img.shields.io/badge/docs-truthound.netlify.app-blue" alt="Documentation"></a>
  <a href="https://pypi.org/project/truthound/"><img src="https://img.shields.io/pypi/v/truthound.svg" alt="PyPI"></a>
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.11+-blue.svg" alt="Python"></a>
  <a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-orange.svg" alt="License"></a>
  <a href="https://pola.rs/"><img src="https://img.shields.io/badge/Powered%20by-Polars-2563EB?logo=polars&logoColor=white" alt="Powered by Polars"></a>
  <a href="https://github.com/ddotta/awesome-polars"><img src="https://awesome.re/badge.svg" alt="Awesome Polars"></a>
  <a href="https://pepy.tech/project/truthound"><img src="https://static.pepy.tech/badge/truthound?color=green" alt="Downloads"></a>
</p>

> Truthound 3.0 turns the familiar `th.check()`, `th.scan()`, `th.mask()`, `th.profile()`, and `th.learn()` facade into a native zero-configuration validation platform built around `TruthoundContext`, `ValidationRunResult`, deterministic auto-suites, and a Polars-first planning/runtime kernel.

---

## Abstract

<p align="center">
  <img width="200" alt="Truthound Icon" src="docs/assets/Truthound_icon_banner.png" />
</p>

Truthound is a Polars-first data validation framework for modern data engineering systems. Version 3.0 keeps the easy first-run experience, but the runtime is now honest about its architecture: a zero-config project context, a deterministic auto-suite builder, backend-aware planning, exact-by-default execution, a single canonical `ValidationRunResult`, and one plugin/reporting surface shared across checkpoints, docs, and automation.

**Documentation**: [truthound.netlify.app](https://truthound.netlify.app/)

<!--
Temporary comment-out: keep the related-projects section hidden until these
projects are sufficiently mature for the public README again.

**Related Projects**

| Project | Description | Status |
| --- | --- | --- |
| [truthound-orchestration](https://github.com/seadonggyun4/truthound-orchestration) | Workflow integration for Airflow, Dagster, Prefect, and dbt | Alpha |
| [truthound-dashboard](https://github.com/seadonggyun4/truthound-dashboard) | Web-based data quality monitoring dashboard | Alpha |
-->

## Why Truthound

- Polars-first execution and planner-driven aggregation instead of repeated validator-side scans
- Extreme zero-configuration by default: `th.check(data)` creates and reuses a local `.truthound/` workspace automatically
- Deterministic auto-suite selection that starts with schema/nullability/type/range/key heuristics instead of "run everything"
- Canonical `ValidationRunResult` shared by checkpoints, reporters, validation docs, and plugins
- Explicit contracts for contexts, check factories, backends, and artifact generation
- Failure-first test lanes and migration diagnostics that make framework upgrades safer in production

## What Changed in 3.0

Truthound 3.0 resets the public contract around a smaller and more durable kernel:

| Layer | Responsibility |
| --- | --- |
| `TruthoundContext` | Auto-discovered project workspace, baselines, run history, docs artifacts, plugin runtime, and resolved defaults |
| `contracts` | Stable ports such as `DataAsset`, `ExecutionBackend`, `MetricRepository`, `ArtifactStore`, and plugin capabilities |
| `suite` | Immutable validation intent via `ValidationSuite`, `CheckSpec`, `SchemaSpec`, evidence policy, and severity policy |
| `planning` | Scan planning, backend routing, metric deduplication, and pushdown eligibility |
| `runtime` | Session lifecycle, retries, timeout-safe execution, exception isolation, and evidence capture |
| `results` | `CheckResult`, `ValidationRunResult`, and `ExecutionIssue` as the canonical output model |

The design is grounded in proven ideas from Great Expectations, Soda, Deequ, and Pandera, but optimized for a simpler zero-config starting point and a Polars-first execution path.

The practical 3.0 changes are:

- `th.check()` returns `ValidationRunResult` directly
- the local `.truthound/` workspace is auto-created and reused
- `validators=None` now means deterministic `AutoSuiteBuilder`, not "run every built-in validator"
- `compare` moved to `truthound.drift.compare`
- checkpoints standardize on `CheckpointResult.validation_run` and `CheckpointResult.validation_view`
- reporters and validation docs consume `ValidationRunResult` directly through reporter contract v3

## Quick Start

### Installation

```bash
pip install truthound
```

```bash
# Development and docs workflows in this repository
uv sync --extra dev --extra docs
```

### Python API

```python
import truthound as th
from truthound.datadocs import generate_validation_report
from truthound.reporters import get_reporter
from truthound.drift import compare

run = th.check(
    {"customer_id": [1, 2, 2], "email": ["a@example.com", None, "c@example.com"]},
)

print(run.execution_mode)
print([check.name for check in run.checks])
print(run.metadata["context_root"])

json_report = get_reporter("json").render(run)
validation_docs = generate_validation_report(run, title="Customer Quality Overview")

context = th.get_context()
schema = th.learn({"id": [1, 2], "status": ["active", "inactive"]})
masked = th.mask(
    {"email": ["a@example.com", "b@example.com"]},
    columns=["email"],
    strategy="hash",
)
drift = compare({"score": [0.1, 0.2]}, {"score": [0.1, 0.8]})
```

### CLI

```bash
truthound check data.csv --validators null,unique
truthound check --connection "sqlite:///warehouse.db" --table users --pushdown
truthound scan pii.csv
truthound profile data.csv
truthound doctor . --migrate-2to3
truthound plugins list --json
```

## Public Surface

The root package intentionally exports a smaller API:

- Stable facade: `check`, `scan`, `mask`, `profile`, `learn`, `read`, `get_context`
- Core types: `TruthoundContext`, `ValidationSuite`, `CheckSpec`, `SchemaSpec`, `ValidationRunResult`, `CheckResult`
- `th.check()` returns `ValidationRunResult` directly
- Checkpoint runtime results: `CheckpointResult.validation_run` is canonical and `CheckpointResult.validation_view` is the compatibility projection for legacy action formatting
- Reporter-facing types: `truthound.reporters.RunPresentation`, `truthound.reporters.ReporterContext`
- Validation docs entry points: `truthound.datadocs.ValidationDocsBuilder`, `truthound.datadocs.generate_validation_report`
- Drift comparison: import from `truthound.drift.compare`
- Advanced systems: import by namespace, for example `truthound.ml`, `truthound.lineage`, `truthound.realtime`, or `truthound.datadocs`

The experimental `use_engine` and `--use-engine` switches remain removed.

## Zero-Config Workflow

Truthound 3.0 auto-creates a `.truthound/` workspace at your project root. By default it manages:

- `.truthound/config.yaml`: resolved project defaults
- `.truthound/catalog/`: asset fingerprints and source signatures
- `.truthound/baselines/`: learned schemas and metric history
- `.truthound/runs/`: persisted `ValidationRunResult` metadata
- `.truthound/docs/`: generated validation docs
- `.truthound/plugins/`: resolved plugin manifest and trust metadata

If you do nothing except call `th.check(data)`, Truthound will:

1. detect the asset/backend
2. resolve the active `TruthoundContext`
3. load or create a baseline
4. synthesize an auto-suite
5. plan and execute the validation
6. persist the run and validation docs when persistence is enabled

## Plugin Platform

Truthound now uses one lifecycle runtime:

- `PluginManager` is the canonical plugin manager
- `EnterprisePluginManager` is an async, capability-driven facade over the same runtime
- Plugins register through stable ports such as `register_check_factory`, `register_data_asset_provider`, `register_reporter`, `register_hook`, and `register_capability`
- Reporter plugins should target the contract-v3 surface where `ValidationRunResult` is the canonical render input and `RunPresentation` is the shared render projection

## Documentation

- Documentation site: [truthound.netlify.app](https://truthound.netlify.app/)
- Getting started: [docs/getting-started/index.md](docs/getting-started/index.md)
- Quickstart: [docs/getting-started/quickstart.md](docs/getting-started/quickstart.md)
- Architecture: [docs/concepts/architecture.md](docs/concepts/architecture.md)
- Zero-config context: [docs/concepts/zero-config.md](docs/concepts/zero-config.md)
- Plugin platform: [docs/concepts/plugins.md](docs/concepts/plugins.md)
- Reporter SDK: [docs/guides/reporter-sdk.md](docs/guides/reporter-sdk.md)
- Checkpoints: [docs/guides/checkpoints.md](docs/guides/checkpoints.md)
- Performance and benchmarks: [docs/guides/performance.md](docs/guides/performance.md)
- Benchmark methodology: [docs/guides/benchmark-methodology.md](docs/guides/benchmark-methodology.md)
- Workload catalog: [docs/guides/benchmark-workloads.md](docs/guides/benchmark-workloads.md)
- GX parity gate: [docs/guides/gx-parity.md](docs/guides/gx-parity.md)
- Migration guide: [docs/guides/migration-3.0.md](docs/guides/migration-3.0.md)
- Legacy archive: [docs/legacy/index.md](docs/legacy/index.md)
- Release notes: [docs/releases/truthound-3.0-rc1.md](docs/releases/truthound-3.0-rc1.md)
- Latest benchmark summary: [docs/releases/latest-benchmark-summary.md](docs/releases/latest-benchmark-summary.md)
- ADRs: [docs/adr/001-validation-kernel.md](docs/adr/001-validation-kernel.md), [docs/adr/002-plugin-platform.md](docs/adr/002-plugin-platform.md), [docs/adr/003-result-model.md](docs/adr/003-result-model.md), [docs/adr/004-migration-compatibility.md](docs/adr/004-migration-compatibility.md)

## Development

```bash
uv run --frozen --extra dev python -m pytest -q
uv run --frozen --extra dev python -m pytest --collect-only -q tests
uv run --frozen --extra dev python -m pytest -q -m "contract or fault or e2e" -p no:cacheprovider
uv run --frozen --extra dev python -m pytest -q -m "contract or fault or integration or soak or stress or scale_100m or e2e" --run-integration --run-expensive --run-soak -p no:cacheprovider
uv run --frozen --extra dev python -m pytest -q tests/test_truthound_3_0_contract.py tests/test_api.py tests/test_public_surface.py tests/test_checkpoint.py -p no:cacheprovider
uv run --frozen --extra benchmarks python -m truthound.cli benchmark parity --suite pr-fast --frameworks truthound --backend local --strict
uv run --frozen --extra benchmarks python -m truthound.cli benchmark parity --suite nightly-core --frameworks both --backend local --strict
uv run --frozen --extra benchmarks python -m truthound.cli benchmark parity --suite nightly-sql --frameworks both --backend sqlite --strict
uv run --frozen --extra benchmarks python -m truthound.cli benchmark parity --suite release-ga --frameworks both --strict
uv run --frozen --extra dev python docs/scripts/check_links.py --mkdocs mkdocs.yml README.md CLAUDE.md
uv run --frozen --extra dev --extra docs mkdocs build --strict
truthound doctor . --migrate-2to3
```

Official 3.0 benchmark claims stay blocked until the fixed self-hosted `release-ga` run produces `release-ga.json`, `env-manifest.json`, and `latest-benchmark-summary.md`.

Tests now follow a failure-first lane model:

- `contract`: stable public API and compatibility boundaries
- `fault`: deterministic failure injection, timeout, corruption, and concurrency scenarios
- `integration`: opt-in backend and external-service coverage
- `soak` and `stress`: nightly-only load and chaos coverage

The default local run is intentionally fast. Manual verification artifacts live under `verification/phase6` and are intentionally kept out of pytest discovery.

Official performance claims should come only from the release-grade parity artifacts under `.truthound/benchmarks/release/`. Nightly outputs are for trend visibility, not marketing numbers.

When adding tests, prefer scenarios that protect public contracts or operational failure modes. Avoid adding default-value, getter/setter, enum-literal, `to_dict()` round-trip, or CSS-string existence tests unless they prove a compatibility boundary that has failed before.
