Pipeline Quality (Column/Dataset-Based Checks)¶
Pipeline quality runs column- or dataset-level checks after an ETL load (e.g. row count, null rate, uniqueness). It is separate from contract quality, which validates each row against a data contract. See Data quality: contract vs pipeline for when to use each.
Enabling pipeline quality in ETL¶
Add quality_checks to your load config. When the pipeline runs, these checks execute after load; results appear in the run response and (when runs are persisted) in the runs dashboard.
YAML example¶
# load.yaml
type: file
path: output/data.csv
quality_checks:
- type: row_count
min: 1
severity: fail
- type: null_rate
fields: [id, name]
max_null_percent: 0
severity: fail
- type: uniqueness
fields: [id]
severity: warn
Check types¶
| Type | Description | Config |
|---|---|---|
row_count |
Min/max row count | min, max, severity |
null_rate |
Max null % per field | fields, max_null_percent, severity |
uniqueness |
Fields must have unique values | fields, severity |
expression |
Custom Python expression | expression, description, severity; scope: row_count, loaded_count, failed_count, records, fields |
severity is fail or warn. If any check with severity fail fails, the overall pipeline quality report is not passed.
REST API¶
ETL run response¶
When you run an ETL pipeline via POST /api/v1/etl/run with quality_checks in the load config, the response includes:
pipeline_quality_report(object or null): Column/dataset quality result.passed:trueif no severity=FAIL check failed.total_checks,failed_count,warning_count.checks: list of per-check results (check_name,check_type,status,severity,message,details).
If the load config has no quality_checks, pipeline_quality_report is null.
Runs API¶
Pipeline run history (GET /api/v1/runs, etc.) stores and returns:
pipeline_quality_passed:true/falsewhen pipeline quality was run;nullotherwise.contract_quality_score/contract_quality_passed: Used for row-based (contract) quality; see Data quality.
So you can tell whether pipeline quality ran and whether it passed, without conflating it with contract quality.
Python API¶
Use the ETL pipeline with a load step that includes quality_checks:
from pycharter import Pipeline
config = {
"extract": {"type": "file", "path": "input.csv"},
"load": {
"type": "file",
"path": "output.csv",
"quality_checks": [
{"type": "row_count", "min": 1, "severity": "fail"},
{"type": "null_rate", "fields": ["id"], "max_null_percent": 0, "severity": "fail"},
],
},
}
pipeline = Pipeline.from_dict(config)
result = await pipeline.run()
if result.quality_report:
print("Pipeline quality passed:", result.quality_report.passed)
for c in result.quality_report.checks:
print(f" {c.check_type}: {c.status.value} - {c.message}")
The same quality_report is exposed in the REST response as pipeline_quality_report.
PostLoadChecker alias¶
The class that runs pipeline quality checks is QualityChecker in pycharter.etl_generator. To avoid confusion with the contract-based QualityCheck, a PostLoadChecker alias is available:
from pycharter.etl_generator import PostLoadChecker # same as QualityChecker
checker = PostLoadChecker(
checks=[
{"type": "row_count", "min": 1, "severity": "fail"},
{"type": "null_rate", "fields": ["id"], "max_null_percent": 0, "severity": "fail"},
],
pipeline_name="orders",
)
report = checker.run(records, load_result)
print(f"Passed: {report.passed}")
| Class | Module | Purpose |
|---|---|---|
QualityCheck |
pycharter.quality |
Contract-based quality scoring (per-row) |
QualityChecker / PostLoadChecker |
pycharter.etl_generator |
Column/dataset checks after load |
See also¶
- Data quality: contract vs pipeline — when to use contract vs pipeline quality.
- Data Quality Monitoring — contract (row-based) quality.
- REST API — quality endpoints and run fields.