Data Profiling¶
Profile datasets to understand their shape, completeness, and distributions. Use profiling to explore new data sources, detect drift between pipeline runs, or produce a quality summary alongside contract validation.
Quick start¶
The simplest way to profile data is the profile_data() function:
from pycharter import profile_data
data = [
{"user_id": 1, "age": 28, "email": "alice@example.com", "score": 0.85},
{"user_id": 2, "age": None, "email": "bob@example.com", "score": 0.62},
{"user_id": 3, "age": 35, "email": None, "score": 0.91},
]
report = profile_data(data)
print(report["record_count"]) # 3
print(report["field_profiles"]["age"]["null_percentage"]) # 33.33
print(report["field_profiles"]["score"]["mean"]) # 0.793
print(report["overall_stats"])
You can also restrict profiling to specific fields:
Using the DataProfiler class¶
For more control (e.g. reusing the profiler across calls), use the class directly:
from pycharter.quality import DataProfiler
profiler = DataProfiler()
report = profiler.profile(data)
Profile structure¶
DataProfiler.profile() returns a dict with three top-level keys:
{
"record_count": int,
"field_profiles": {
"<field_name>": { ... per-field stats ... },
...
},
"overall_stats": { ... dataset-level stats ... },
}
Per-field stats¶
Every field profile contains:
| Key | Type | Description |
|---|---|---|
field_name |
str |
Field name |
null_count |
int |
Number of null/None values |
non_null_count |
int |
Number of non-null values |
null_percentage |
float |
null_count / record_count * 100 |
distinct_count |
int |
Number of unique values |
distinct_percentage |
float |
distinct_count / record_count * 100 |
data_type |
str |
Inferred Python type (int, float, str, bool, mixed, null) |
sample_values |
list |
Up to 5 representative values |
For numeric fields, the profile also includes:
| Key | Description |
|---|---|
min |
Minimum value |
max |
Maximum value |
mean |
Arithmetic mean |
median |
Median |
stdev |
Standard deviation (or None if fewer than 2 values) |
For string fields, the profile also includes:
| Key | Description |
|---|---|
min_length |
Shortest string length |
max_length |
Longest string length |
mean_length |
Average string length |
Overall stats¶
| Key | Description |
|---|---|
total_fields |
Number of fields profiled |
fields_with_nulls |
Number of fields that have at least one null |
completeness |
(1 - avg_null_rate) * 100 — overall data completeness % |
Profiling a subset of fields¶
Pass fields to restrict the profile to specific columns:
Useful when profiling wide tables where you only care about a few key fields.
Integration with QualityCheck¶
Run a quality check and a profile side by side:
from pycharter import QualityCheck
from pycharter.quality import DataProfiler
check = QualityCheck(metadata_store=store)
profiler = DataProfiler()
# After pipeline loads data:
quality_report = check.run(schema_id=schema_id, data=output_data)
profile = profiler.profile(output_data)
# Log key stats
for field, stats in profile["field_profiles"].items():
if stats["null_percentage"] > 10:
print(f"Warning: {field} has {stats['null_percentage']:.1f}% nulls")
print(f"Overall completeness: {profile['overall_stats']['completeness']:.1f}%")
Detecting drift between runs¶
Store profile results between runs and compare them:
import json
from pathlib import Path
from pycharter.quality import DataProfiler
PROFILE_FILE = Path("./.profiles/orders.json")
def run_with_drift_check(data):
profiler = DataProfiler()
current = profiler.profile(data)
if PROFILE_FILE.exists():
previous = json.loads(PROFILE_FILE.read_text())
for field, stats in current["field_profiles"].items():
prev_stats = previous["field_profiles"].get(field, {})
if prev_stats:
null_delta = stats["null_percentage"] - prev_stats.get("null_percentage", 0)
if abs(null_delta) > 5:
print(f"Drift: {field} null rate changed by {null_delta:+.1f}%")
PROFILE_FILE.parent.mkdir(parents=True, exist_ok=True)
PROFILE_FILE.write_text(json.dumps(current, indent=2, default=str))
return current
Profiling inside a pipeline¶
Attach profiling to a pipeline's output using the PipelineResult.metadata field or a post-load callback:
import asyncio
from pycharter import Pipeline
from pycharter.quality import DataProfiler
async def run_with_profile():
pipeline = Pipeline.from_config_dir("pipelines/orders/")
result = await pipeline.run()
if result.success and result.output_data:
profiler = DataProfiler()
profile = profiler.profile(result.output_data)
completeness = profile["overall_stats"]["completeness"]
print(f"Data completeness: {completeness:.1f}%")
asyncio.run(run_with_profile())
See also¶
- Data Quality Monitoring — threshold-based monitoring with
QualityCheck - Pipeline quality checks — column/dataset checks embedded in load config
- API Reference — QualityCheck