Skip to content

Data Profiling

Profile datasets to understand their shape, completeness, and distributions. Use profiling to explore new data sources, detect drift between pipeline runs, or produce a quality summary alongside contract validation.

Quick start

The simplest way to profile data is the profile_data() function:

from pycharter import profile_data

data = [
    {"user_id": 1, "age": 28,  "email": "alice@example.com", "score": 0.85},
    {"user_id": 2, "age": None, "email": "bob@example.com",  "score": 0.62},
    {"user_id": 3, "age": 35,  "email": None,                "score": 0.91},
]

report = profile_data(data)

print(report["record_count"])   # 3
print(report["field_profiles"]["age"]["null_percentage"])   # 33.33
print(report["field_profiles"]["score"]["mean"])            # 0.793
print(report["overall_stats"])

You can also restrict profiling to specific fields:

report = profile_data(data, fields=["age", "score"])

Using the DataProfiler class

For more control (e.g. reusing the profiler across calls), use the class directly:

from pycharter.quality import DataProfiler

profiler = DataProfiler()
report = profiler.profile(data)

Profile structure

DataProfiler.profile() returns a dict with three top-level keys:

{
    "record_count": int,
    "field_profiles": {
        "<field_name>": { ... per-field stats ... },
        ...
    },
    "overall_stats": { ... dataset-level stats ... },
}

Per-field stats

Every field profile contains:

Key Type Description
field_name str Field name
null_count int Number of null/None values
non_null_count int Number of non-null values
null_percentage float null_count / record_count * 100
distinct_count int Number of unique values
distinct_percentage float distinct_count / record_count * 100
data_type str Inferred Python type (int, float, str, bool, mixed, null)
sample_values list Up to 5 representative values

For numeric fields, the profile also includes:

Key Description
min Minimum value
max Maximum value
mean Arithmetic mean
median Median
stdev Standard deviation (or None if fewer than 2 values)

For string fields, the profile also includes:

Key Description
min_length Shortest string length
max_length Longest string length
mean_length Average string length

Overall stats

Key Description
total_fields Number of fields profiled
fields_with_nulls Number of fields that have at least one null
completeness (1 - avg_null_rate) * 100 — overall data completeness %

Profiling a subset of fields

Pass fields to restrict the profile to specific columns:

report = profiler.profile(data, fields=["age", "score"])

Useful when profiling wide tables where you only care about a few key fields.


Integration with QualityCheck

Run a quality check and a profile side by side:

from pycharter import QualityCheck
from pycharter.quality import DataProfiler

check = QualityCheck(metadata_store=store)
profiler = DataProfiler()

# After pipeline loads data:
quality_report = check.run(schema_id=schema_id, data=output_data)
profile = profiler.profile(output_data)

# Log key stats
for field, stats in profile["field_profiles"].items():
    if stats["null_percentage"] > 10:
        print(f"Warning: {field} has {stats['null_percentage']:.1f}% nulls")

print(f"Overall completeness: {profile['overall_stats']['completeness']:.1f}%")

Detecting drift between runs

Store profile results between runs and compare them:

import json
from pathlib import Path
from pycharter.quality import DataProfiler

PROFILE_FILE = Path("./.profiles/orders.json")

def run_with_drift_check(data):
    profiler = DataProfiler()
    current = profiler.profile(data)

    if PROFILE_FILE.exists():
        previous = json.loads(PROFILE_FILE.read_text())
        for field, stats in current["field_profiles"].items():
            prev_stats = previous["field_profiles"].get(field, {})
            if prev_stats:
                null_delta = stats["null_percentage"] - prev_stats.get("null_percentage", 0)
                if abs(null_delta) > 5:
                    print(f"Drift: {field} null rate changed by {null_delta:+.1f}%")

    PROFILE_FILE.parent.mkdir(parents=True, exist_ok=True)
    PROFILE_FILE.write_text(json.dumps(current, indent=2, default=str))
    return current

Profiling inside a pipeline

Attach profiling to a pipeline's output using the PipelineResult.metadata field or a post-load callback:

import asyncio
from pycharter import Pipeline
from pycharter.quality import DataProfiler

async def run_with_profile():
    pipeline = Pipeline.from_config_dir("pipelines/orders/")
    result = await pipeline.run()

    if result.success and result.output_data:
        profiler = DataProfiler()
        profile = profiler.profile(result.output_data)
        completeness = profile["overall_stats"]["completeness"]
        print(f"Data completeness: {completeness:.1f}%")

asyncio.run(run_with_profile())

See also