Metadata-Version: 2.4
Name: truthound
Version: 2.0.0
Summary: Zero-Configuration Data Quality Framework Powered by Polars
Project-URL: Homepage, https://github.com/seadonggyun4/Truthound
Project-URL: Repository, https://github.com/seadonggyun4/Truthound
Project-URL: Issues, https://github.com/seadonggyun4/Truthound/issues
Author-email: seadonggyun4 <seadonggyun4@gmail.com>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: data-masking,data-quality,data-validation,pii-detection,polars
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: polars>=1.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: typer>=0.12.0
Provides-Extra: all
Requires-Dist: aiokafka>=0.9.0; extra == 'all'
Requires-Dist: azure-storage-blob>=12.0.0; extra == 'all'
Requires-Dist: boto3>=1.26.0; extra == 'all'
Requires-Dist: duckdb>=1.0.0; extra == 'all'
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'all'
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'all'
Requires-Dist: jinja2>=3.0.0; extra == 'all'
Requires-Dist: motor>=3.0.0; extra == 'all'
Requires-Dist: pandas>=2.0.0; extra == 'all'
Requires-Dist: reflex>=0.4.0; extra == 'all'
Requires-Dist: scikit-learn>=1.3.0; extra == 'all'
Requires-Dist: scipy>=1.10.0; extra == 'all'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'all'
Requires-Dist: weasyprint>=60.0; extra == 'all'
Requires-Dist: xxhash>=3.4.0; extra == 'all'
Provides-Extra: anomaly
Requires-Dist: scikit-learn>=1.3.0; extra == 'anomaly'
Requires-Dist: scipy>=1.10.0; extra == 'anomaly'
Provides-Extra: async-datasources
Requires-Dist: aiokafka>=0.9.0; extra == 'async-datasources'
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'async-datasources'
Requires-Dist: motor>=3.0.0; extra == 'async-datasources'
Provides-Extra: azure
Requires-Dist: azure-storage-blob>=12.0.0; extra == 'azure'
Provides-Extra: dashboard
Requires-Dist: reflex>=0.4.0; extra == 'dashboard'
Provides-Extra: database
Requires-Dist: sqlalchemy>=2.0.0; extra == 'database'
Provides-Extra: dev
Requires-Dist: mypy>=1.10.0; extra == 'dev'
Requires-Dist: pandas>=2.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Requires-Dist: scikit-learn>=1.3.0; extra == 'dev'
Requires-Dist: scipy>=1.10.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5.0; extra == 'docs'
Requires-Dist: mkdocs>=1.6.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'docs'
Requires-Dist: pymdown-extensions>=10.8.0; extra == 'docs'
Provides-Extra: drift
Requires-Dist: scipy>=1.10.0; extra == 'drift'
Provides-Extra: duckdb
Requires-Dist: duckdb>=1.0.0; extra == 'duckdb'
Provides-Extra: elasticsearch
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'elasticsearch'
Provides-Extra: gcs
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'gcs'
Provides-Extra: kafka
Requires-Dist: aiokafka>=0.9.0; extra == 'kafka'
Provides-Extra: mongodb
Requires-Dist: motor>=3.0.0; extra == 'mongodb'
Provides-Extra: nosql
Requires-Dist: elasticsearch[async]>=8.0.0; extra == 'nosql'
Requires-Dist: motor>=3.0.0; extra == 'nosql'
Provides-Extra: pdf
Requires-Dist: weasyprint>=60.0; extra == 'pdf'
Provides-Extra: perf
Requires-Dist: xxhash>=3.4.0; extra == 'perf'
Provides-Extra: reports
Requires-Dist: jinja2>=3.0.0; extra == 'reports'
Provides-Extra: s3
Requires-Dist: boto3>=1.26.0; extra == 's3'
Provides-Extra: stores
Requires-Dist: azure-storage-blob>=12.0.0; extra == 'stores'
Requires-Dist: boto3>=1.26.0; extra == 'stores'
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'stores'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'stores'
Provides-Extra: streaming
Requires-Dist: aiokafka>=0.9.0; extra == 'streaming'
Description-Content-Type: text/markdown

# Truthound

Truthound is a Polars-first data validation framework. The 2.0 redesign keeps the familiar `th.check()`, `th.scan()`, `th.mask()`, `th.profile()`, and `th.learn()` entry points, but routes validation through a smaller core kernel with explicit suite, planning, runtime, result, and plugin boundaries.

## Why 2.0

Truthound now centers on five internal layers:

| Layer | Responsibility |
| --- | --- |
| `contracts` | Stable ports such as `DataAsset`, `ExecutionBackend`, `MetricRepository`, and plugin capabilities |
| `suite` | Immutable validation intent via `ValidationSuite`, `CheckSpec`, `SchemaSpec`, evidence policy, and severity policy |
| `planning` | Scan planning, backend routing, duplicate check accounting, and pushdown eligibility |
| `runtime` | Session lifecycle, retries, timeout-safe execution, exception isolation, and evidence capture |
| `results` | `CheckResult`, `ValidationRunResult`, and `ExecutionIssue` as the canonical output model |

This structure is intentionally informed by several mature validation systems:

- Great Expectations: separation of suite definition, execution, and artifacts
- Soda: scan planning and backend-aware execution
- Deequ: analyzer, constraint, verification, and repository decomposition
- Pandera: schema-first modeling and lazy validation ergonomics

## Quick Start

```bash
pip install truthound
```

```python
import truthound as th
from truthound.datadocs import generate_validation_report
from truthound.reporters import get_reporter

report = th.check(
    {"id": [1, 2, 2], "email": ["a@example.com", None, "c@example.com"]},
    validators=["null", "unique"],
)

print(report)
print(report.validation_run.execution_mode)
print([check.name for check in report.validation_run.checks])

json_report = get_reporter("json").render(report.validation_run)
validation_docs = generate_validation_report(report.validation_run)
```

```bash
truthound check data.csv --validators null,unique
truthound check --connection "sqlite:///warehouse.db" --table users --pushdown
truthound plugins list --json
```

## Public Surface

The root package intentionally exports a smaller API:

- Stable facade: `check`, `scan`, `mask`, `profile`, `learn`, `read`
- Core types: `ValidationSuite`, `CheckSpec`, `SchemaSpec`, `ValidationRunResult`, `CheckResult`
- Checkpoint runtime results: `CheckpointResult.validation_run` is canonical; `validation_result` remains as a deprecated compatibility alias
- Reporter-facing types: `truthound.reporters.RunPresentation`, `truthound.reporters.ReporterContext`
- Validation docs entry points: `truthound.datadocs.ValidationDocsBuilder`, `truthound.datadocs.generate_validation_report`
- Advanced systems: import by namespace, for example `truthound.ml`, `truthound.lineage`, or `truthound.datadocs`

The experimental `use_engine` and `--use-engine` switches were removed as part of the 2.0 cleanup.

## Plugin Platform

Truthound now uses one lifecycle runtime:

- `PluginManager` is the canonical plugin manager
- `EnterprisePluginManager` is an async, capability-driven facade over the same runtime
- Plugins register through stable ports such as `register_check_factory`, `register_data_asset_provider`, `register_reporter`, `register_hook`, and `register_capability`
- Reporter plugins should target the contract-v2 surface where `ValidationRunResult` is the canonical input

## Documentation

- Architecture: [docs/concepts/architecture.md](docs/concepts/architecture.md)
- Plugin platform: [docs/concepts/plugins.md](docs/concepts/plugins.md)
- Reporter SDK: [docs/guides/reporter-sdk.md](docs/guides/reporter-sdk.md)
- Checkpoints: [docs/guides/checkpoints.md](docs/guides/checkpoints.md)
- Migration guide: [docs/guides/migration-2.0.md](docs/guides/migration-2.0.md)
- Legacy archive: [docs/legacy/index.md](docs/legacy/index.md)
- Release notes: [docs/releases/truthound-2.0.md](docs/releases/truthound-2.0.md)
- ADRs: [docs/adr/001-validation-kernel.md](docs/adr/001-validation-kernel.md), [docs/adr/002-plugin-platform.md](docs/adr/002-plugin-platform.md), [docs/adr/003-result-model.md](docs/adr/003-result-model.md), [docs/adr/004-migration-compatibility.md](docs/adr/004-migration-compatibility.md)

## Development

```bash
uv run --frozen --extra dev python -m pytest -q
uv run --frozen --extra dev python -m pytest --collect-only -q tests
uv run --frozen --extra dev python -m pytest -q -m "contract or fault or e2e" -p no:cacheprovider
uv run --frozen --extra dev python -m pytest -q -m "contract or fault or integration or soak or stress or scale_100m or e2e" --run-integration --run-expensive --run-soak -p no:cacheprovider
uv run --frozen --extra dev python docs/scripts/check_links.py --mkdocs mkdocs.yml README.md CLAUDE.md
uv run --frozen --extra dev --extra docs mkdocs build --strict
```

Tests now follow a failure-first lane model:

- `contract`: stable public API and compatibility boundaries
- `fault`: deterministic failure injection, timeout, corruption, and concurrency scenarios
- `integration`: opt-in backend and external-service coverage
- `soak` and `stress`: nightly-only load and chaos coverage

The default local run is intentionally fast. Manual verification artifacts live under `verification/phase6` and are intentionally kept out of pytest discovery.

When adding tests, prefer scenarios that protect public contracts or operational failure modes. Avoid adding default-value, getter/setter, enum-literal, `to_dict()` round-trip, or CSS-string existence tests unless they prove a compatibility boundary that has failed before.
