Skip to content

Pipeline Quality (Column/Dataset-Based Checks)

Pipeline quality runs column- or dataset-level checks after an ETL load (e.g. row count, null rate, uniqueness). It is separate from contract quality, which validates each row against a data contract. See Data quality: contract vs pipeline for when to use each.

Enabling pipeline quality in ETL

Add quality_checks to your load config. When the pipeline runs, these checks execute after load; results appear in the run response and (when runs are persisted) in the runs dashboard.

YAML example

# load.yaml
type: file
path: output/data.csv
quality_checks:
  - type: row_count
    min: 1
    severity: fail
  - type: null_rate
    fields: [id, name]
    max_null_percent: 0
    severity: fail
  - type: uniqueness
    fields: [id]
    severity: warn

Check types

Type Description Config
row_count Min/max row count min, max, severity
null_rate Max null % per field fields, max_null_percent, severity
uniqueness Fields must have unique values fields, severity
expression Custom Python expression expression, description, severity; scope: row_count, loaded_count, failed_count, records, fields

severity is fail or warn. If any check with severity fail fails, the overall pipeline quality report is not passed.

REST API

ETL run response

When you run an ETL pipeline via POST /api/v1/etl/run with quality_checks in the load config, the response includes:

  • pipeline_quality_report (object or null): Column/dataset quality result.
  • passed: true if no severity=FAIL check failed.
  • total_checks, failed_count, warning_count.
  • checks: list of per-check results (check_name, check_type, status, severity, message, details).

If the load config has no quality_checks, pipeline_quality_report is null.

Runs API

Pipeline run history (GET /api/v1/runs, etc.) stores and returns:

  • pipeline_quality_passed: true/false when pipeline quality was run; null otherwise.
  • contract_quality_score / contract_quality_passed: Used for row-based (contract) quality; see Data quality.

So you can tell whether pipeline quality ran and whether it passed, without conflating it with contract quality.

Python API

Use the ETL pipeline with a load step that includes quality_checks:

from pycharter import Pipeline

config = {
    "extract": {"type": "file", "path": "input.csv"},
    "load": {
        "type": "file",
        "path": "output.csv",
        "quality_checks": [
            {"type": "row_count", "min": 1, "severity": "fail"},
            {"type": "null_rate", "fields": ["id"], "max_null_percent": 0, "severity": "fail"},
        ],
    },
}
pipeline = Pipeline.from_dict(config)
result = await pipeline.run()

if result.quality_report:
    print("Pipeline quality passed:", result.quality_report.passed)
    for c in result.quality_report.checks:
        print(f"  {c.check_type}: {c.status.value} - {c.message}")

The same quality_report is exposed in the REST response as pipeline_quality_report.

PostLoadChecker alias

The class that runs pipeline quality checks is QualityChecker in pycharter.etl_generator. To avoid confusion with the contract-based QualityCheck, a PostLoadChecker alias is available:

from pycharter.etl_generator import PostLoadChecker  # same as QualityChecker

checker = PostLoadChecker(
    checks=[
        {"type": "row_count", "min": 1, "severity": "fail"},
        {"type": "null_rate", "fields": ["id"], "max_null_percent": 0, "severity": "fail"},
    ],
    pipeline_name="orders",
)
report = checker.run(records, load_result)
print(f"Passed: {report.passed}")
Class Module Purpose
QualityCheck pycharter.quality Contract-based quality scoring (per-row)
QualityChecker / PostLoadChecker pycharter.etl_generator Column/dataset checks after load

See also