Skip to content

Quality

PyCharter provides two quality systems. This page covers contract quality — row-level validation with scoring, violations, and profiling. For column/dataset-level checks after ETL loads, see Pipeline Quality and PostLoadChecker.

Quick Start

One-Liner Quality Check

The fastest way to check quality:

from pycharter import check_quality

report = check_quality(
    contract={"schema": {
        "version": "1.0.0",
        "properties": {"name": {"type": "string"}, "email": {"type": "string", "format": "email"}},
        "required": ["name", "email"]
    }},
    data=[
        {"name": "Alice", "email": "alice@example.com"},
        {"name": "", "email": "invalid"},
    ],
)

print(f"Score: {report.quality_score.overall_score:.1f}/100")
print(f"Valid: {report.valid_count}/{report.record_count}")

Quick Data Profiling

Profile a dataset without a contract:

from pycharter import profile_data

profile = profile_data([{"name": "Alice", "age": 30}, {"name": "Bob", "age": None}])
print(f"Records: {profile['record_count']}")
print(f"Age nulls: {profile['field_profiles']['age']['null_count']}")

Convenience Functions

check_quality

from pycharter import check_quality

report = check_quality(
    contract=contract_dict_or_file_path,
    data=records_or_file_path,
    options=None,  # Defaults to QualityCheckOptions.basic()
)
Parameter Type Description
contract dict \| str Contract dict or file path
data list[dict] \| str \| Callable Records, file path, or callable
options QualityCheckOptions \| None Options (defaults to basic())

Returns: QualityReport

check_quality_with_store

from pycharter import check_quality_with_store

report = check_quality_with_store(
    store=store,
    contract_name="user",
    contract_version="1.0.0",
    data=records,
)
Parameter Type Description
store MetadataStoreClient Connected metadata store
contract_name str Contract name in the store
contract_version str Contract version
data list[dict] \| str \| Callable Records, file path, or callable
options QualityCheckOptions \| None Options (defaults to basic())

Returns: QualityReport

profile_data

from pycharter import profile_data

profile = profile_data(data, fields=["name", "age"])  # or fields=None for all
Parameter Type Description
data list[dict] Records to profile
fields list[str] \| None Subset of fields (all if None)

Returns: dict with record_count, field_profiles, overall_stats

See the Data Profiling Guide for the full profile structure.


QualityCheckOptions Presets

Instead of configuring every option, use a preset:

from pycharter import QualityCheckOptions

opts = QualityCheckOptions.basic()       # Quick check
opts = QualityCheckOptions.strict()      # Gated check with thresholds
opts = QualityCheckOptions.monitoring()  # Recurring check with dedup
Preset Metrics Violations Profiling Thresholds Skip unchanged Dedup
basic() Yes Yes No No No Yes
strict() Yes Yes Yes Yes (defaults) No Yes
monitoring() Yes Yes Yes Yes (defaults) Yes Yes

You can also customize any preset:

opts = QualityCheckOptions.strict()
opts.sample_size = 1000  # Only check a sample

QualityCheck Class

For store-backed schemas, database persistence, or advanced control:

from pycharter import QualityCheck, QualityThresholds

check = QualityCheck(store=store)
report = check.run(
    schema_id="user_schema",
    data=records,
    thresholds=QualityThresholds(min_overall_score=95.0)
)

API Reference

QualityCheck

QualityCheck(
    store: MetadataStoreClient | None = None,
    db_session: "Session" | None = None,
)

Contract-based quality scoring engine — orchestrator-agnostic.

Validates data against a data contract, calculates quality scores, records violations, and optionally checks thresholds.

This class can be used: - Standalone (CLI, API, Python scripts) - Within orchestrators (Airflow, Prefect, Dagster) - Via API calls

For post-load structural checks (row count, null rate, uniqueness), see PostLoadChecker in pycharter.etl_generator.

Parameters:

Name Type Description Default
store MetadataStoreClient | None

Optional metadata store for retrieving contracts and storing violations

None
db_session 'Session' | None

Optional SQLAlchemy database session for persisting metrics and violations

None
Source code in src/pycharter/quality/check.py
def __init__(
    self,
    store: MetadataStoreClient | None = None,
    db_session: "Session" | None = None,
):
    """
    Initialize quality check.

    Args:
        store: Optional metadata store for retrieving contracts and storing violations
        db_session: Optional SQLAlchemy database session for persisting metrics and violations
    """
    self.store = store
    self.db_session = db_session
    self.metrics = QualityMetrics()
    self.violation_tracker = ViolationTracker(store=store, db_session=db_session)
    self.profiler = DataProfiler()

run

run(
    contract_name: str | None = None,
    contract_version: str | None = None,
    contract: dict[str, Any] | str | None = None,
    data: (
        list[dict[str, Any]]
        | str
        | Callable[[], Any]
        | None
    ) = None,
    options: QualityCheckOptions | None = None,
) -> QualityReport

Run a quality check against a data contract. Use (contract_name, contract_version) for store-based validation, or contract for in-memory.

Source code in src/pycharter/quality/check.py
def run(
    self,
    contract_name: str | None = None,
    contract_version: str | None = None,
    contract: dict[str, Any] | str | None = None,
    data: list[dict[str, Any]] | str | Callable[[], Any] | None = None,
    options: QualityCheckOptions | None = None,
) -> QualityReport:
    """
    Run a quality check against a data contract.
    Use (contract_name, contract_version) for store-based validation, or contract for in-memory.
    """
    if options is None:
        options = QualityCheckOptions()

    schema_id = (
        f"{contract_name}:{contract_version}"
        if (contract_name and contract_version)
        else None
    )

    data_list = self._load_data(data)
    data_fingerprint = self._calculate_data_fingerprint(data_list)
    data_source = options.data_source or self._get_data_source(data)

    if options.sample_size and options.sample_size < len(data_list):
        import random

        data_list = random.sample(data_list, options.sample_size)

    if options.skip_if_unchanged and self.db_session and data_fingerprint:
        existing_metric = self._get_existing_metric(
            schema_id=schema_id,
            data_fingerprint=data_fingerprint,
            data_version=options.data_version,
        )
        if existing_metric:
            pass

    profile_data = None
    if options.include_profiling:
        profile_data = self.profiler.profile(data_list)

    validation_results = self._validate_data(
        contract_name=contract_name,
        contract_version=contract_version,
        contract=contract,
        data_list=data_list,
    )

    # Calculate metrics
    quality_score = None
    field_metrics = {}
    if options.calculate_metrics:
        quality_score = self.metrics.calculate_quality_score(validation_results)
        if options.include_field_metrics:
            field_metrics = self.metrics.calculate_field_metrics(validation_results)

    violation_count = 0
    if options.record_violations:
        violation_count = self._record_violations(
            schema_id=schema_id,
            contract=contract,
            data_list=data_list,
            validation_results=validation_results,
            options=options,
        )

    # Check thresholds
    threshold_breaches = []
    if options.check_thresholds and options.thresholds and quality_score:
        threshold_breaches = options.thresholds.check(quality_score)

    # Build report
    valid_count = sum(1 for r in validation_results if r.is_valid)
    invalid_count = len(validation_results) - valid_count

    schema_version = None
    if contract_name and contract_version and self.store:
        try:
            full_schema = self.store.get_complete_schema(
                contract_name, contract_version
            )
            schema_version = (
                full_schema.get("version") if full_schema else contract_version
            )
        except Exception:
            pass

    report_metadata = {
        "sample_size": options.sample_size,
        "contract_based": contract is not None,
        "data_fingerprint": data_fingerprint,
        "data_source": data_source,
    }
    if options.data_version:
        report_metadata["data_version"] = options.data_version
    if profile_data:
        report_metadata["profiling"] = profile_data
    if options.metadata:
        report_metadata.update(options.metadata)

    report = QualityReport(
        schema_id=schema_id or "unknown",
        schema_version=schema_version,
        quality_score=quality_score,
        field_metrics=field_metrics,
        violation_count=violation_count,
        record_count=len(data_list),
        valid_count=valid_count,
        invalid_count=invalid_count,
        threshold_breaches=threshold_breaches,
        metadata=report_metadata,
    )

    # Persist quality metrics to database if session is available
    if self.db_session and quality_score:
        self._persist_quality_metrics(
            report,
            schema_id,
            schema_version,
            data_fingerprint=data_fingerprint,
            data_version=options.data_version,
            data_source=data_source,
            skip_if_unchanged=options.skip_if_unchanged,
        )

    return report

run_by_state

run_by_state(
    contract_name: str | None = None,
    contract_version: str | None = None,
    contract: dict[str, Any] | str | None = None,
    data: (
        list[dict[str, Any]]
        | str
        | Callable[[], Any]
        | None
    ) = None,
    state_field: str = "status",
    options: QualityCheckOptions | None = None,
) -> dict[str, QualityReport]

Run quality check segmented by state value.

Groups the data by the value of state_field, runs a separate quality check for each group, and returns a mapping from state value to :class:QualityReport.

Parameters:

Name Type Description Default
contract_name str | None

Contract name for store-based validation.

None
contract_version str | None

Contract version for store-based validation.

None
contract dict[str, Any] | str | None

In-memory contract dict or file path.

None
data list[dict[str, Any]] | str | Callable[[], Any] | None

Data source (list, file path, or callable).

None
state_field str

Field name to group records by (default "status").

'status'
options QualityCheckOptions | None

Optional quality check options (applied to each group).

None

Returns:

Type Description
dict[str, QualityReport]

Dict mapping each unique state value to its QualityReport.

Example

qc = QualityCheck() reports = qc.run_by_state( ... contract=contract_dict, ... data=[{"status": "NEW", "x": 1}, {"status": "ACTIVE", "x": 2}], ... state_field="status", ... ) print(reports.keys()) # dict_keys(['NEW', 'ACTIVE'])

Source code in src/pycharter/quality/check.py
def run_by_state(
    self,
    contract_name: str | None = None,
    contract_version: str | None = None,
    contract: dict[str, Any] | str | None = None,
    data: list[dict[str, Any]] | str | Callable[[], Any] | None = None,
    state_field: str = "status",
    options: QualityCheckOptions | None = None,
) -> dict[str, QualityReport]:
    """Run quality check segmented by state value.

    Groups the data by the value of *state_field*, runs a separate
    quality check for each group, and returns a mapping from state
    value to :class:`QualityReport`.

    Args:
        contract_name: Contract name for store-based validation.
        contract_version: Contract version for store-based validation.
        contract: In-memory contract dict or file path.
        data: Data source (list, file path, or callable).
        state_field: Field name to group records by (default ``"status"``).
        options: Optional quality check options (applied to each group).

    Returns:
        Dict mapping each unique state value to its ``QualityReport``.

    Example:
        >>> qc = QualityCheck()
        >>> reports = qc.run_by_state(
        ...     contract=contract_dict,
        ...     data=[{"status": "NEW", "x": 1}, {"status": "ACTIVE", "x": 2}],
        ...     state_field="status",
        ... )
        >>> print(reports.keys())  # dict_keys(['NEW', 'ACTIVE'])
    """
    data_list = self._load_data(data)

    # Group records by state value
    grouped: dict[str, list[dict[str, Any]]] = {}
    for record in data_list:
        state = record.get(state_field, "_unknown")
        state_str = str(state) if state is not None else "_unknown"
        grouped.setdefault(state_str, []).append(record)

    # Run quality check for each group
    results: dict[str, QualityReport] = {}
    for state_value, records in grouped.items():
        results[state_value] = self.run(
            contract_name=contract_name,
            contract_version=contract_version,
            contract=contract,
            data=records,
            options=options,
        )
    return results

QualityThresholds

QualityThresholds

Bases: BaseModel

Quality thresholds for alerting.

min_overall_score class-attribute instance-attribute

min_overall_score: float = 95.0

max_violation_rate class-attribute instance-attribute

max_violation_rate: float = 0.05

min_completeness class-attribute instance-attribute

min_completeness: float = 0.95

min_accuracy class-attribute instance-attribute

min_accuracy: float = 0.95

QualityCheckOptions

QualityCheckOptions

Bases: BaseModel

Options for quality checks.

basic classmethod

Create options for quick one-off quality checks.

Enables metrics and violation recording. Disables profiling and threshold checking for speed.

Returns:

Type Description
QualityCheckOptions

QualityCheckOptions configured for basic checks.

Source code in src/pycharter/quality/models.py
@classmethod
def basic(cls) -> QualityCheckOptions:
    """Create options for quick one-off quality checks.

    Enables metrics and violation recording. Disables profiling and
    threshold checking for speed.

    Returns:
        QualityCheckOptions configured for basic checks.
    """
    return cls(
        record_violations=True,
        calculate_metrics=True,
        check_thresholds=False,
        thresholds=None,
        include_field_metrics=True,
        include_profiling=False,
    )

strict classmethod

strict() -> QualityCheckOptions

Create options for gated quality checks.

Enables all features including profiling and threshold checking with default thresholds. Use this when quality must meet minimum standards before proceeding.

Returns:

Type Description
QualityCheckOptions

QualityCheckOptions configured for strict checks.

Source code in src/pycharter/quality/models.py
@classmethod
def strict(cls) -> QualityCheckOptions:
    """Create options for gated quality checks.

    Enables all features including profiling and threshold checking
    with default thresholds. Use this when quality must meet minimum
    standards before proceeding.

    Returns:
        QualityCheckOptions configured for strict checks.
    """
    return cls(
        record_violations=True,
        calculate_metrics=True,
        check_thresholds=True,
        thresholds=QualityThresholds(),
        include_field_metrics=True,
        include_profiling=True,
    )

monitoring classmethod

monitoring() -> QualityCheckOptions

Create options for scheduled/recurring quality checks.

Enables all features plus deduplication and skip-if-unchanged to avoid redundant work in monitoring pipelines.

Returns:

Type Description
QualityCheckOptions

QualityCheckOptions configured for monitoring.

Source code in src/pycharter/quality/models.py
@classmethod
def monitoring(cls) -> QualityCheckOptions:
    """Create options for scheduled/recurring quality checks.

    Enables all features plus deduplication and skip-if-unchanged
    to avoid redundant work in monitoring pipelines.

    Returns:
        QualityCheckOptions configured for monitoring.
    """
    return cls(
        record_violations=True,
        calculate_metrics=True,
        check_thresholds=True,
        thresholds=QualityThresholds(),
        include_field_metrics=True,
        include_profiling=True,
        skip_if_unchanged=True,
        deduplicate_violations=True,
    )

QualityReport

The report returned by QualityCheck.run() and the convenience functions:

Attribute Type Description
schema_id str Schema identifier
check_timestamp str ISO timestamp
quality_score QualityScore Quality metrics
field_metrics dict Per-field metrics
record_count int Total records
valid_count int Valid records
invalid_count int Invalid records
violation_count int Total violations
threshold_breaches list[str] Breached thresholds
passed bool All thresholds passed

QualityScore

Attribute Type Description
overall_score float 0-100 quality score
violation_rate float 0-1 violation ratio
completeness float 0-1 completeness ratio
accuracy float 0-1 accuracy ratio
field_scores dict[str, float] Per-field scores

Examples

One-Liner with Strict Thresholds

from pycharter import check_quality, QualityCheckOptions

report = check_quality(
    contract="contracts/user.yaml",
    data="data/users.json",
    options=QualityCheckOptions.strict(),
)

if not report.passed:
    print(f"Breaches: {report.threshold_breaches}")

Store-Based with Custom Options

from pycharter import QualityCheck, QualityCheckOptions, QualityThresholds

check = QualityCheck(store=store)
report = check.run(
    schema_id="user_schema",
    data=records,
    options=QualityCheckOptions(
        calculate_metrics=True,
        record_violations=True,
        check_thresholds=True,
        thresholds=QualityThresholds(min_overall_score=95.0),
        include_field_metrics=True,
        sample_size=1000,
    )
)

Quality Gate in a Pipeline

from pycharter import check_quality, QualityCheckOptions

report = check_quality(contract="contracts/orders.yaml", data=loaded_records,
                       options=QualityCheckOptions.strict())

if not report.passed:
    raise RuntimeError(f"Quality gate failed: {report.threshold_breaches}")

See Also