Core Concepts¶
Understanding PyCharter's key concepts will help you use it effectively.
Data Contracts¶
A data contract is a formal specification that defines:
- Schema: The structure and types of your data (JSON Schema)
- Coercion Rules: Pre-validation transformations (e.g., string → integer)
- Validation Rules: Post-validation business constraints
- Metadata: Description, ownership, governance information
- Ontology (optional): Semantic field annotations—what each field means in the business domain (concepts, definitions, relationships)
Contract Structure¶
# user_contract.yaml
schema:
type: object
version: "1.0.0"
properties:
name:
type: string
minLength: 1
email:
type: string
format: email
age:
type: integer
coercion: coerce_to_integer # Pre-validation coercion
validations:
is_positive: {} # Post-validation check
required:
- name
- email
metadata:
title: User Contract
description: Defines the structure of user records
version: "1.0.0"
ownership:
owner: data-team
steward: alice@example.com
# Optional: ontology (semantic field annotations; see Wiki / Concepts)
# ontology:
# version: "1.0.0"
# fields:
# email: { concept: user_email, definition: "Primary email address" }
When you use build_contract() or load a contract from the store, the contract dictionary contains a raw schema plus coercion_rules and validation_rules as separate keys (rules are not merged into the schema). The Validator merges rules internally when validating.
Contract Lifecycle¶
graph LR
A[Define Contract] --> B[Parse]
B --> C[Store in Registry]
C --> D[Generate Validators]
D --> E[Validate Data]
E --> F[Monitor Quality]
F --> A
Validation Pipeline¶
When data is validated, it goes through three stages:
1. Coercion (Pre-validation)¶
Transforms input data before validation:
Built-in coercions:
| Coercion | Description |
|---|---|
coerce_to_string |
Convert to string |
coerce_to_integer |
Convert to integer |
coerce_to_float |
Convert to float |
coerce_to_boolean |
Convert to boolean |
coerce_to_datetime |
Parse ISO datetime |
coerce_to_date |
Parse date only |
coerce_to_lowercase |
Lowercase string |
coerce_to_uppercase |
Uppercase string |
2. Schema Validation¶
Validates against JSON Schema (Draft 2020-12):
3. Custom Validation (Post-validation)¶
Applies business rules after schema validation:
Built-in validations:
| Validation | Description | Config |
|---|---|---|
min_length |
Minimum string/array length | {"threshold": N} |
max_length |
Maximum string/array length | {"threshold": N} |
is_positive |
Value > 0 | {} |
is_email |
Valid email format | {} |
matches_regex |
Match pattern | {"pattern": "..."} |
only_allow |
Whitelist values | {"allowed_values": [...]} |
ETL Pipeline Architecture¶
PyCharter's ETL system follows the Extract-Transform-Load pattern:
graph TB
subgraph Extract
E1[HTTPExtractor]
E2[FileExtractor]
E3[DatabaseExtractor]
E4[CloudStorageExtractor]
end
subgraph Transform
T1[Rename]
T2[Filter]
T3[AddField]
T4[Convert]
T5[CustomFunction]
end
subgraph Load
L1[PostgresLoader]
L2[FileLoader]
L3[CloudStorageLoader]
end
E1 --> T1
E2 --> T1
E3 --> T1
E4 --> T1
T1 --> T2 --> T3 --> T4 --> T5
T5 --> L1
T5 --> L2
T5 --> L3
Pipeline Composition¶
Pipelines are built using the | (pipe) operator:
Async Execution¶
All pipeline operations are async for better performance:
import asyncio
# From a script
result = asyncio.run(pipeline.run())
# From async code
result = await pipeline.run()
Metadata Store¶
The metadata store is a centralized registry for:
- Schemas and their versions
- Coercion rules
- Validation rules
- Metadata (ownership, description, governance)
- Ontology (when using a wiki-enabled store)
- Quality metrics history
Store Backends¶
| Backend | Use Case |
|---|---|
InMemoryMetadataStore |
Testing, development |
SQLiteMetadataStore |
Single-user, local development |
PostgresMetadataStore |
Production, multi-user |
MongoDBMetadataStore |
Document-oriented workloads |
RedisMetadataStore |
High-performance caching |
Schema Versioning¶
Schemas are versioned to track changes:
# Store a new version
store.store_schema("user", schema_v1, version="1.0.0")
store.store_schema("user", schema_v2, version="2.0.0")
# Get specific version
schema = store.get_schema("user", version="1.0.0")
# Get latest version
schema = store.get_schema("user") # Returns 2.0.0
Quality Assurance¶
PyCharter supports two types of quality checks: contract (row-based) and pipeline (column/dataset-based). See Data quality: contract vs pipeline for an overview and when to use each.
Contract quality metrics (row-based)¶
| Metric | Description |
|---|---|
overall_score |
Overall quality (0-100) |
violation_rate |
% of records with errors |
completeness |
% of non-null required fields |
accuracy |
% of valid values |
Quality Thresholds¶
Set alerts when quality drops:
thresholds = QualityThresholds(
min_overall_score=95.0, # Alert if score < 95
max_violation_rate=0.05, # Alert if violations > 5%
)
Violation Tracking¶
Every validation error is tracked:
# Query violations
violations = store.query_violations(
schema_id="user_schema",
status="open",
severity="high"
)
API Tiers¶
PyCharter's API is organized into three tiers:
Tier 1: Primary Classes (Recommended)¶
Best performance, full features:
from pycharter import Validator, Pipeline, QualityCheck
validator = Validator.from_file("contract.yaml")
Tier 2: Convenience Functions¶
Quick start, one-off operations:
from pycharter import from_dict, validate, validate_with_contract
Model = from_dict(schema, "User")
result = validate(Model, data)
Tier 3: Low-Level Utilities¶
When you need fine-grained control:
Error Handling¶
PyCharter uses a structured exception hierarchy:
from pycharter.shared.errors import (
PyCharterError, # Base exception
ConfigError, # Config loading/parsing
ConfigValidationError, # Schema validation
ExpressionError, # Expression evaluation
)
try:
pipeline = Pipeline.from_config_dir("invalid/")
except ConfigError as e:
print(f"Config error: {e}")
except PyCharterError as e:
print(f"PyCharter error: {e}")
Error Modes¶
Control how pipeline errors are handled:
from pycharter.shared.errors import ErrorMode, ErrorContext
# Strict: raise on first error (default)
result = await pipeline.run(error_context=ErrorContext(mode=ErrorMode.STRICT))
# Lenient: log and continue
result = await pipeline.run(error_context=ErrorContext(mode=ErrorMode.LENIENT))
# Collect: gather all errors
result = await pipeline.run(error_context=ErrorContext(mode=ErrorMode.COLLECT))
print(result.errors) # List of all errors
Next Steps¶
- ETL Pipelines Tutorial - Build data pipelines
- Contracts Tutorial - Master validation
- Data quality: contract vs pipeline - Two types of quality checks
- Quality Tutorial - Monitor contract (row-based) quality
- Validation Worker - Run validation asynchronously
- API Reference - Complete documentation