The Complete Data Journey: From Contract to Runtime Validation¶
This document describes the complete data production journey using PyCharter, from initial contract specification (collaboration between business units and developers) to runtime validation in production systems.
Overview: The 6-Stage Journey¶
Business Unit (Data Definitions & Governance)
↓
Developer (Data Types & Technical Constraints)
↓
Contract File (YAML/JSON) - Single Source of Truth
↓
[PyCharter Services: Parse → Store → Build → Generate → Validate]
↓
Runtime Validation (Production Systems)
↓
(Optional) REST API for Service Integration
PyCharter Core Services: 1. Contract Parser: Decomposes contract files into components 2. Metadata Store: Stores components (PostgreSQL, MongoDB, Redis, or InMemory) 3. Contract Builder: Reconstructs complete contracts from stored components 4. Pydantic Generator: Generates Pydantic models from JSON Schemas 5. JSON Schema Converter: Converts Pydantic models to JSON Schemas 6. Runtime Validator: Validates data against contracts 7. Quality Assurance: Data quality checks, metrics, violation tracking, and profiling 8. REST API (optional): HTTP endpoints for all services
Stage 1: Contract Creation (Business + Developer Collaboration)¶
Participants: - Business Unit: Defines data definitions, governance rules, ownership, and business requirements - Developer: Adds technical constraints, data types, formats, and validation rules
What Happens: - Business stakeholders define what data means, who owns it, and governance policies - Developers add technical specifications: data types, formats (UUID, email, date-time), validation constraints (minLength, maxLength, pattern), and technical requirements - Together they create a data contract file (YAML or JSON) that serves as the single source of truth
Contract File Structure (data/examples/book/book_contract.yaml):
schema: # Developer: Technical schema definition
type: object
properties:
user_id:
type: string
format: uuid # Developer: Technical constraint
description: Unique identifier for the user
username:
type: string
minLength: 3 # Developer: Technical validation
maxLength: 20
pattern: "^[a-z0-9_]+$"
description: Username (lowercase alphanumeric and underscores only)
email:
type: string
format: email # Developer: Technical format
description: User's email address
age:
type: integer
minimum: 0 # Developer: Business rule enforcement
maximum: 150
description: User's age in years
created_at:
type: string
format: date-time
description: Account creation timestamp
required:
- user_id
- username
- email
- created_at
governance_rules: # Business: Data governance policies
data_retention:
days: 365 # Business: Retention policy
description: User data should be retained for 365 days
pii_fields: # Business: Privacy requirements
fields:
- email
- user_id
description: Fields containing personally identifiable information
access_control:
level: restricted
description: User data requires restricted access
ownership: # Business: Ownership information
owner: data-team
team: engineering
contact: data-team@example.com
description: Data team owns user data contracts
metadata: # Both: Versioning and documentation
version: "1.0.0"
description: User data contract for authentication and profile management
created: "2024-01-01"
last_updated: "2024-01-15"
# Optional: ontology (semantic field annotations; see Wiki / Concepts)
# ontology:
# version: "1.0.0"
# fields:
# email: { concept: user_email, definition: "Primary email address" }
Key Points: - Contract file is the single source of truth for data structure and rules - Separates business concerns (governance, ownership) from technical concerns (types, constraints) - Versioned for tracking changes over time - Human-readable format (YAML or JSON) for collaboration
Output: A contract file that combines business requirements with technical specifications
Stage 2: Contract Parsing (Developer)¶
Service: pycharter.contract_parser
Who: Developer (automated process, typically in CI/CD or setup scripts)
What Happens:
- Contract file is parsed and decomposed into structured components
- Separates schema, governance_rules, ownership, metadata, and optional ontology into distinct objects
- Returns a ContractMetadata object that makes each component accessible independently
Code Example:
from pycharter import parse_contract_file, ContractMetadata
# Parse the contract file (YAML or JSON)
metadata = parse_contract_file("data/examples/book/book_contract.yaml")
# Access decomposed components
schema = metadata.schema # JSON Schema for model generation
governance = metadata.governance_rules # For governance enforcement
ownership = metadata.ownership # For access control
metadata_info = metadata.metadata # Version, description, etc.
ontology = metadata.ontology # Optional semantic field annotations
# Or parse from a dictionary
contract_dict = {
"schema": {...},
"governance_rules": {...},
"ownership": {...},
"metadata": {...}
}
metadata = parse_contract(contract_dict)
Why This Matters: - Separates concerns: schema can be used independently for model generation - Governance rules can be enforced separately - Ownership information can be used for access control - Metadata can be used for versioning and documentation
Output: Structured ContractMetadata object with separated components
Stage 3: Metadata Storage (Developer)¶
Service: pycharter.metadata_store
Who: Developer (one-time setup, then automated)
What Happens: - Decomposed metadata components are stored in a metadata store (database or in-memory) - Multiple store implementations available: PostgreSQL, MongoDB, Redis, or InMemory - Schemas are versioned for evolution tracking - Governance rules and ownership are stored as part of metadata (not separate entities) - Enables querying, versioning, and retrieval of stored metadata
Available Metadata Store Implementations:
- PostgresMetadataStore: PostgreSQL database (requires
psycopg2-binary) - Requires schema initialization:
pycharter db init <connection_string> - Supports migrations:
pycharter db upgrade -
Optional seeding:
pycharter db seed <seed_dir> <connection_string> -
MongoDBMetadataStore: MongoDB database (requires
pymongo) - Auto-creates indexes on first connection
-
No separate initialization needed
-
InMemoryMetadataStore: In-memory storage (no dependencies)
- Perfect for testing and development
-
Data is lost when process ends
-
RedisMetadataStore: Redis database (requires
redis) - Fast key-value storage
- Good for caching and high-throughput scenarios
Code Example:
from pycharter import PostgresMetadataStore, MongoDBMetadataStore, InMemoryMetadataStore
# Option 1: PostgreSQL (requires: pycharter db init first)
store = PostgresMetadataStore(connection_string="postgresql://user:pass@localhost/db")
store.connect()
# Option 2: MongoDB (auto-initializes)
store = MongoDBMetadataStore(
connection_string="mongodb://user:pass@localhost:27017/",
database_name="pycharter"
)
store.connect()
# Option 3: InMemory (for testing)
store = InMemoryMetadataStore()
store.connect()
# Store schema with versioning
schema_id = store.store_schema(
schema_name="user",
schema=metadata.schema,
version=metadata.versions.get("schema", "1.0.0")
)
# Merge ownership and governance rules into metadata before storing
# Ownership and governance are part of metadata, not separate entities
metadata_dict = metadata.metadata.copy() if metadata.metadata else {}
if metadata.ownership:
metadata_dict["business_owners"] = [metadata.ownership.get("owner", "unknown")] if metadata.ownership.get("owner") else []
if metadata.governance_rules:
metadata_dict["governance_rules"] = metadata.governance_rules
# Store metadata once with all information (ownership and governance included)
if metadata_dict:
store.store_metadata(
resource_id=schema_id,
resource_type="schema",
metadata=metadata_dict
)
# Store coercion rules (if extracted from schema)
if metadata.coercion_rules:
store.store_coercion_rules(
schema_id=schema_id,
coercion_rules=metadata.coercion_rules,
version=metadata.versions.get("coercion_rules", "1.0.0")
)
# Store validation rules (if extracted from schema)
if metadata.validation_rules:
store.store_validation_rules(
schema_id=schema_id,
validation_rules=metadata.validation_rules,
version=metadata.versions.get("validation_rules", "1.0.0")
)
store.disconnect()
Database Initialization (PostgreSQL only):
# Initialize database schema
pycharter db init postgresql://user:pass@localhost/pycharter
# Apply migrations
pycharter db upgrade postgresql://user:pass@localhost/pycharter
# (Optional) Seed initial data (owners, domains, systems)
pycharter db seed data/seed postgresql://user:pass@localhost/pycharter
Why This Matters: - Versioning: Track schema evolution over time - Queryability: Find schemas by name, version, owner, etc. - Audit Trail: Know who owns what and when it changed - Centralized Storage: Single source of truth in database - Multi-Application: Multiple applications can retrieve same schemas - Flexibility: Choose the right store for your use case (PostgreSQL for ACID, MongoDB for flexibility, InMemory for testing)
Output: All metadata stored in database, versioned and queryable
Stage 4: Model Generation (Developer - On-Demand)¶
Service: pycharter.pydantic_generator and pycharter.contract_builder
Who: Developer (in application code, ETL scripts, APIs)
What Happens: - Retrieve schema from metadata store (or use directly from parsed contract) - Optionally rebuild complete contract from store using Contract Builder - Dynamically generate Pydantic model class at runtime - Model includes all validations, constraints, and types from the schema - Model is fully functional and can be used like any Pydantic model
Code Example:
from pycharter import (
from_dict,
build_contract_from_store,
get_model_from_contract,
get_model_from_store
)
# Option 1: Get schema from metadata store and generate model
store.connect()
schema = store.get_schema(schema_id) # Retrieve stored schema
UserModel = from_dict(schema, "User")
store.disconnect()
# Option 2: Rebuild complete contract from store (includes metadata, rules, etc.)
store.connect()
reconstructed_contract = build_contract_from_store(
store=store,
schema_id=schema_id,
version=None, # Use latest version
include_metadata=True,
include_ownership=True,
include_governance=True
)
# Contract has: schema (RAW), coercion_rules (separate), validation_rules (separate)
# Generate model - get_model_from_contract handles merging internally
UserModel = get_model_from_contract(
contract=reconstructed_contract,
model_name="User"
)
store.disconnect()
# Option 3: Use convenience function to get model directly from store
store.connect()
UserModel = get_model_from_store(
store=store,
schema_id=schema_id,
model_name="User",
version=None
)
store.disconnect()
# Option 4: Use schema directly from parsed contract
schema = metadata.schema
UserModel = from_dict(schema, "User")
# Option 5: Load from file
schema = json.load(open("schema.json"))
UserModel = from_dict(schema, "User")
# Now you have a fully-functional Pydantic model!
# It includes all validations: UUID format, email format, minLength, etc.
# Use it like any Pydantic model
user = UserModel(
user_id="123e4567-e89b-12d3-a456-426614174000",
username="alice",
email="alice@example.com",
age=30,
created_at="2024-01-15T10:30:00Z"
)
# Validation happens automatically
try:
invalid_user = UserModel(username="ab", email="not-email", age=-5)
except ValidationError as e:
print("Validation failed:", e)
Why This Matters: - Dynamic Generation: No need to write model classes manually - Always Up-to-Date: Models generated from latest schema version - Type Safety: Full Pydantic type checking and validation - Runtime Flexibility: Generate models on-demand when needed - Consistency: All applications use same schema definitions - Complete Contracts: Contract Builder consolidates all components; schema is raw and rules are separate; Validator merges rules internally during validation
Output: A Pydantic model class ready for validation
Stage 5: Runtime Validation (Developer - In Production)¶
Service: pycharter.runtime_validator
Who: Developer (in production code: ETL pipelines, APIs, data processing scripts)
What Happens: - Validate incoming data against the generated Pydantic model - Catch contract violations early before data enters your system - Return structured validation results with error details - Support both single record and batch validation
Code Example:
from pycharter import validate, validate_batch, ValidationResult
# Single record validation (e.g., API endpoint)
def process_api_request(incoming_data: dict):
result: ValidationResult = validate(UserModel, incoming_data)
if result.is_valid:
# Data passes all validations from contract
validated_user = result.data
# Continue processing...
save_to_database(validated_user)
return {"status": "success", "user_id": validated_user.user_id}
else:
# Contract violations detected
return {
"status": "error",
"errors": result.errors,
"message": "Data does not conform to contract"
}
# Batch validation (e.g., ETL pipeline processing CSV or database records)
def process_etl_batch(batch_of_records: list[dict]):
results = validate_batch(UserModel, batch_of_records)
valid_records = [r.data for r in results if r.is_valid]
invalid_records = [
{"data": batch_of_records[i], "errors": r.errors}
for i, r in enumerate(results) if not r.is_valid
]
# Process valid records
save_to_database(valid_records)
# Handle invalid records (log, send to DLQ, etc.)
if invalid_records:
log_validation_errors(invalid_records)
send_to_dead_letter_queue(invalid_records)
return {
"processed": len(batch_of_records),
"valid": len(valid_records),
"invalid": len(invalid_records)
}
# Strict mode (raises exceptions instead of returning results)
def strict_validation(data: dict):
try:
result = validate(UserModel, data, strict=True)
return result.data
except ValidationError as e:
# Handle exception
raise ValueError(f"Contract violation: {e}")
Why This Matters: - Early Detection: Catch data quality issues before they propagate - Contract Enforcement: Ensure all data conforms to business and technical rules - Error Reporting: Detailed error messages help identify issues - Production Safety: Prevent bad data from entering your systems - Batch Processing: Efficiently validate large datasets
Output: Validated data ready for processing, or error information for invalid data
Stage 6: Quality Assurance (Developer - In Production)¶
Service: pycharter.quality
Who: Developer (in production code: ETL pipelines, data quality monitoring, scheduled jobs)
What Happens: - Run quality checks against data to measure data quality metrics - Track violations and quality scores over time - Profile data to understand its characteristics - Persist quality metrics and violations to database for historical tracking - Monitor data quality trends and alert on threshold breaches
Code Example:
from pycharter import QualityCheck, QualityCheckOptions, DataProfiler
from pycharter.db.models.base import get_session
from pycharter.config import get_database_url
# Initialize quality check with database session for persistence
db_url = get_database_url()
db_session = get_session(db_url)
check = QualityCheck(store=store, db_session=db_session)
# Run quality check with profiling enabled
report = check.run(
schema_id="user_schema_v1",
data="data/users.json",
options=QualityCheckOptions(
calculate_metrics=True,
record_violations=True,
include_profiling=True, # Enable data profiling
check_thresholds=True,
thresholds=QualityThresholds(
min_overall_score=95.0,
max_violation_rate=0.05,
min_completeness=0.95,
min_accuracy=0.95
)
)
)
# Access quality metrics
print(f"Overall Score: {report.quality_score.overall_score:.2f}/100")
print(f"Accuracy: {report.quality_score.accuracy:.2%}")
print(f"Completeness: {report.quality_score.completeness:.2%}")
print(f"Violation Rate: {report.quality_score.violation_rate:.2%}")
# Access profiling data
if "profiling" in report.metadata:
profile = report.metadata["profiling"]
print(f"Record Count: {profile['record_count']}")
print(f"Average Completeness: {profile['overall_stats']['average_completeness']:.2%}")
# Check threshold breaches
if report.threshold_breaches:
print("⚠ Threshold Breaches:")
for breach in report.threshold_breaches:
print(f" - {breach}")
# Query violations from database
violations = check.violation_tracker.get_violations(
schema_id="user_schema_v1",
status="open",
severity="critical"
)
# Get violation summary
summary = check.violation_tracker.get_violation_summary(schema_id="user_schema_v1")
print(f"Total Violations: {summary['total']}")
print(f"Open: {summary['open']}, Resolved: {summary['resolved']}")
# Standalone data profiling
profiler = DataProfiler()
profile = profiler.profile(data_list)
print(f"Field Profiles: {profile['field_profiles']}")
Quality Metrics: - Overall Score: Composite quality score (0-100) - Accuracy: Percentage of valid records - Completeness: Percentage of required fields present - Violation Rate: Percentage of records with violations - Field-Level Metrics: Per-field completeness and violation rates
Violation Tracking: - Individual violations are recorded with details (field, error type, error message) - Violations can be filtered by schema, status, severity, date range - Violations can be resolved and tracked over time - All violations are persisted to database when database session is provided
Data Profiling: - Statistical analysis of data characteristics - Per-field profiling: null counts, data types, unique values, distributions - Type-specific statistics: numeric (min/max/mean/median/std_dev), string (length stats), boolean (distribution) - Most common values and patterns
Database Persistence:
- Quality metrics are automatically saved to quality_metrics table
- Violations are automatically saved to quality_violations table
- Historical tracking enables trend analysis and monitoring
- All metrics include timestamps for time-series analysis
Why This Matters: - Proactive Monitoring: Detect data quality issues before they impact downstream systems - Historical Tracking: Track quality trends over time - Violation Management: Track and resolve data quality issues systematically - Data Understanding: Profiling helps understand data characteristics and patterns - Threshold Alerting: Automatic alerts when quality falls below thresholds - Production Safety: Ensure data quality meets business requirements
Output: Quality report with metrics, violations, and profiling data (optionally persisted to database)
Complete End-to-End Flow¶
Developer Workflow Example¶
Here's a complete example showing all stages working together:
from pycharter import (
parse_contract_file,
MetadataStoreClient,
from_dict,
validate,
ValidationResult
)
# ============================================================
# SETUP PHASE (One-time or when contract changes)
# ============================================================
# Step 1: Parse contract (from business + developer collaboration)
metadata = parse_contract_file("data/examples/book/book_contract.yaml")
# Step 2: Store in database
# First, initialize database (PostgreSQL only)
# Run: pycharter db init postgresql://user:pass@localhost/pycharter
store = PostgresMetadataStore(connection_string="postgresql://...")
store.connect()
schema_id = store.store_schema(
schema_name="user",
schema=metadata.schema,
version=metadata.versions.get("schema", "1.0.0")
)
# Merge ownership and governance rules into metadata before storing
# Ownership and governance are part of metadata, not separate entities
metadata_dict = metadata.metadata.copy() if metadata.metadata else {}
if metadata.ownership:
metadata_dict["business_owners"] = [metadata.ownership.get("owner", "unknown")] if metadata.ownership.get("owner") else []
if metadata.governance_rules:
metadata_dict["governance_rules"] = metadata.governance_rules
# Store metadata once with all information (ownership and governance included)
if metadata_dict:
store.store_metadata(
resource_id=schema_id,
resource_type="schema",
metadata=metadata_dict
)
store.disconnect()
# ============================================================
# RUNTIME PHASE (In production code)
# ============================================================
# Step 3: Retrieve schema and generate model (on-demand)
store.connect()
schema = store.get_schema(schema_id) # Get latest version
UserModel = from_dict(schema, "User") # Generate model
store.disconnect()
# Step 4: Validate incoming data (in your ETL/API/processing code)
def process_incoming_user_data(raw_data: dict):
result: ValidationResult = validate(UserModel, raw_data)
if result.is_valid:
# Data passes all validations from contract
validated_user = result.data
# Continue processing...
return validated_user
else:
# Contract violations detected
raise ValueError(f"Data contract violation: {result.errors}")
Key Benefits of This Journey¶
1. Single Source of Truth¶
- Contract file defines everything: schema, governance, ownership, metadata
- No duplication or drift between definitions
- Changes propagate automatically through the system
2. Separation of Concerns¶
- Business: Defines governance rules, ownership, retention policies
- Developer: Defines technical constraints, types, formats, validations
- Both collaborate in the same contract file but maintain their domains
3. Versioning & Evolution¶
- Track schema evolution over time in database
- Support multiple versions simultaneously
- Enable gradual migration and rollback
4. Runtime Flexibility¶
- Generate models on-demand from stored schemas
- No need to redeploy code when schemas change
- Applications automatically use latest schema versions
5. Type Safety & Validation¶
- Pydantic models provide full type checking
- All contract validations enforced automatically
- Catch errors early in the pipeline
6. Early Error Detection¶
- Validate data before it enters your systems
- Prevent bad data from propagating
- Detailed error messages help identify issues
7. Multi-Application Support¶
- One contract → Stored once → Multiple apps retrieve
- All applications validate against same contract
- Ensures consistency across the organization
Real-World Scenarios¶
Scenario 1: New Contract Version¶
Flow:
Business updates contract → Developer parses new version →
Store new version in database → Applications automatically retrieve new version →
Validation uses updated rules
Code:
# Parse new contract version
metadata_v2 = parse_contract_file("data/examples/book/book_contract.yaml")
# Store as new version
schema_id_v2 = store.store_schema(
schema_name="user",
schema=metadata_v2.schema,
version="2.0.0" # New version
)
# Applications can query by version
schema_v1 = store.get_schema_by_name_and_version("user", "1.0.0")
schema_v2 = store.get_schema_by_name_and_version("user", "2.0.0")
Scenario 2: Multiple Applications¶
Flow:
One contract → Stored once in database →
Multiple apps retrieve schema → Each generates its own model →
All validate against same contract
Code:
# Application A (ETL Pipeline)
schema = store.get_schema(schema_id)
UserModel = from_dict(schema, "User")
# ... validate ETL data
# Application B (API Service)
schema = store.get_schema(schema_id) # Same schema
UserModel = from_dict(schema, "User")
# ... validate API requests
# Application C (Data Quality Tool)
schema = store.get_schema(schema_id) # Same schema
UserModel = from_dict(schema, "User")
# ... validate data quality checks
Scenario 3: Schema Evolution¶
Flow:
Old schema v1.0 → New schema v1.1 → Both stored in database →
Apps can query by version → Gradual migration possible →
Old apps use v1.0, new apps use v1.1
Code:
# Store multiple versions
schema_id_v1 = store.store_schema("user", schema_v1, version="1.0.0")
schema_id_v2 = store.store_schema("user", schema_v2, version="1.1.0")
# Legacy application uses old version
legacy_schema = store.get_schema(schema_id_v1)
LegacyUserModel = from_dict(legacy_schema, "User")
# New application uses new version
new_schema = store.get_schema(schema_id_v2)
NewUserModel = from_dict(new_schema, "User")
Best Practices¶
1. Contract Design¶
- Keep contracts versioned and documented
- Separate business concerns from technical concerns
- Use descriptive field names and descriptions
- Include examples in metadata
2. Storage Strategy¶
- Choose the right metadata store for your use case:
- PostgreSQL: ACID transactions, complex queries, production systems
- MongoDB: Flexible schema, document storage, rapid development
- Redis: High-throughput, caching, ephemeral data
- InMemory: Testing, development, prototyping
- Store all contract components (schema, metadata with ownership/governance, coercion rules, validation rules, ontology when using wiki-enabled store)
- Merge ownership and governance into metadata before storing (they are part of metadata, not separate)
- Use proper versioning (semantic versioning recommended)
- For PostgreSQL: Initialize schema with
pycharter db initand use migrations - Implement proper indexing for fast retrieval
- Consider retention policies for old versions
3. Model Generation¶
- Generate models on-demand rather than at startup
- Cache generated models when appropriate
- Handle schema changes gracefully
- Log model generation for debugging
4. Validation Strategy¶
- Validate early in your pipeline
- Use batch validation for large datasets
- Log validation errors for monitoring
- Send invalid data to dead letter queues
- Provide clear error messages to users
5. Monitoring & Observability¶
- Track validation success/failure rates
- Monitor schema version usage
- Alert on high validation failure rates
- Track contract evolution over time
Additional Services¶
Contract Builder (pycharter.contract_builder)¶
Purpose: Reconstructs complete contracts from separate artifacts stored in metadata store.
When to Use: When you need to rebuild a complete contract from stored components (schema, metadata, rules) for runtime validation or contract export.
Example:
from pycharter import build_contract_from_store
# Rebuild complete contract from store
contract = build_contract_from_store(
store=store,
schema_title=schema_id,
schema_version=None, # Use latest version
include_metadata=True,
include_ownership=True,
include_governance=True
)
# Contract contains: schema (raw), coercion_rules and validation_rules (separate), metadata, ownership, governance_rules, versions
REST API (api/)¶
Purpose: HTTP endpoints for all PyCharter services using FastAPI.
Location: Root-level api/ directory (separate from pycharter/ package)
When to Use: When you need to access PyCharter services via HTTP, integrate with other services, or provide a web interface.
Installation:
Running:
Documentation: - Swagger UI: http://localhost:8000/docs - ReDoc: http://localhost:8000/redoc
See api/README.md for complete API documentation.
Summary¶
The PyCharter data journey provides a complete solution for managing data contracts from specification to runtime validation and quality assurance:
- Contract Creation: Business and developers collaborate to define data contracts
- Contract Parsing: Decompose contracts into structured components
- Metadata Storage: Store components in metadata store (PostgreSQL, MongoDB, Redis, or InMemory) with versioning
- Contract Building: Reconstruct complete contracts from stored components
- Model Generation: Dynamically generate Pydantic models from schemas
- Runtime Validation: Validate data against contracts in production
- Quality Assurance: Monitor data quality, track violations, and profile data characteristics
This journey ensures: - ✅ Single source of truth (contract files) - ✅ Separation of concerns (business vs. technical) - ✅ Versioning and evolution tracking - ✅ Runtime flexibility and type safety - ✅ Early error detection and prevention - ✅ Consistency across multiple applications - ✅ Multiple storage backends (PostgreSQL, MongoDB, Redis, InMemory) - ✅ REST API for service integration - ✅ Data quality monitoring and violation tracking - ✅ Historical quality metrics and trend analysis - ✅ Data profiling for understanding data characteristics
By following this journey, you can maintain data quality, enforce contracts, and ensure consistency across your entire data infrastructure.
Separated Workflow: Schema, Metadata, and Rules Stored Separately¶
The separated workflow is an improved approach where schemas, metadata, and coercion/validation rules are stored and managed separately, enabling better collaboration between business units and developers.
Overview¶
The separated workflow addresses the need for: 1. Business units to define metadata (ownership, governance, versioning) independently 2. Developers to define schemas (as Pydantic models) independently 3. Both to collaborate on coercion and validation rules 4. Runtime to retrieve and combine all components automatically
The Separated Workflow¶
┌─────────────────────────────────────────────────────────────┐
│ Step 1: Business Unit Provides Metadata │
│ - Ownership information │
│ - Governance rules │
│ - Version information │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Step 2: Developer Writes Pydantic Model │
│ - Define data types and structure │
│ - Convert to JSON Schema │
│ - Store schema separately │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Step 3: Developer + Business Define Rules │
│ - Coercion rules (data transformation) │
│ - Validation rules (business + technical checks) │
│ - Store rules separately │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Step 4: Store All Components Separately │
│ - Schema stored in database │
│ - Metadata stored in database │
│ - Coercion rules stored in database │
│ - Validation rules stored in database │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Step 5: Runtime Validation │
│ - Retrieve schema from store │
│ - Retrieve coercion rules from store │
│ - Retrieve validation rules from store │
│ - Merge all components │
│ - Generate Pydantic model │
│ - Validate data │
└─────────────────────────────────────────────────────────────┘
Step-by-Step: Separated Workflow¶
Step 1: Business Unit Provides Metadata¶
business_metadata = {
"owner": "data-team",
"team": "engineering",
"contact": "data-team@example.com",
"description": "User data contract for authentication",
"governance_rules": {
"data_retention": {"days": 365},
"pii_fields": {"fields": ["email", "user_id"]},
},
"version": "1.0.0",
}
Step 2: Developer Writes Pydantic Model and Converts to JSON Schema¶
from pydantic import BaseModel, Field
from pycharter import to_dict
# Developer writes Pydantic model
class User(BaseModel):
user_id: str = Field(..., description="Unique identifier")
username: str = Field(..., min_length=3, max_length=20)
email: str = Field(..., description="User's email address")
age: int = Field(..., ge=0, le=150)
# Convert to JSON Schema
schema = to_dict(User)
Step 3: Developer + Business Define Coercion and Validation Rules¶
# Coercion rules (data transformation before validation)
coercion_rules = {
"user_id": "coerce_to_string",
"age": "coerce_to_integer",
}
# Validation rules (additional checks after validation)
validation_rules = {
"username": {
"no_capital_characters": None, # Business requirement
"min_length": {"threshold": 3}, # Developer constraint
},
"age": {
"is_positive": {"threshold": 0}, # Business requirement
},
}
Step 4: Store All Components Separately¶
from pycharter import InMemoryMetadataStore, PostgresMetadataStore, MongoDBMetadataStore
# Choose your store implementation
store = InMemoryMetadataStore() # or PostgresMetadataStore, MongoDBMetadataStore, etc.
store.connect()
# Store schema (from developer)
schema_id = store.store_schema("user", schema, version="1.0.0")
# Merge ownership and governance into metadata before storing
# Ownership and governance are part of metadata, not separate entities
metadata_dict = business_metadata.copy() if business_metadata else {}
if ownership:
metadata_dict["business_owners"] = [ownership.get("owner", "unknown")] if ownership.get("owner") else []
if governance_rules:
metadata_dict["governance_rules"] = governance_rules
# Store metadata once with all information (ownership and governance included)
if metadata_dict:
store.store_metadata(
resource_id=schema_id,
resource_type="schema",
metadata=metadata_dict
)
# Store coercion rules (from developer + business)
if coercion_rules:
store.store_coercion_rules(schema_id, coercion_rules, version="1.0.0")
# Store validation rules (from developer + business)
if validation_rules:
store.store_validation_rules(schema_id, validation_rules, version="1.0.0")
Step 5: Runtime Validation¶
from pycharter import validate_with_store, get_model_from_store
# Option 1: Use convenience function (retrieves all and validates)
result = validate_with_store(
store=store,
schema_id=schema_id,
data=incoming_data,
strict=False,
)
# Option 2: Get model once and reuse
UserModel = get_model_from_store(store, schema_id, "User")
result = validate(UserModel, incoming_data)
Benefits of Separated Workflow¶
- Clear Separation of Concerns: Business owns metadata, developer owns schema, both collaborate on rules
- Independent Versioning: Each component can be versioned independently
- Independent Updates: Update components without affecting others
- Better Collaboration: Business and developer can work in parallel
- Runtime Flexibility: Retrieve and combine components on-demand
API Reference: Separated Workflow¶
Store Components Separately¶
# Store schema (in schemas table)
schema_id = store.store_schema(schema_name, schema, version=None)
# Store metadata (in metadata_records table)
metadata_record_id = store.store_metadata(resource_id, metadata, resource_type="schema")
# Store coercion rules (in coercion_rules table)
coercion_rules_id = store.store_coercion_rules(schema_id, coercion_rules, version=None)
# Store validation rules (in validation_rules table)
validation_rules_id = store.store_validation_rules(schema_id, validation_rules, version=None)
Retrieve Components¶
# Get schema only (from schemas table)
schema = store.get_schema(schema_id)
# Get coercion rules only (from coercion_rules table)
coercion_rules = store.get_coercion_rules(schema_id, version=None)
# Get validation rules only (from validation_rules table)
validation_rules = store.get_validation_rules(schema_id, version=None)
# Get metadata record (from metadata_records table)
metadata_record = store.get_metadata(resource_id, resource_type="schema")
# Get complete schema (with rules merged - for display/docs, not for editing)
# Note: For validation, prefer using Validator class which handles merging internally
complete_schema = store.get_complete_schema(schema_id, version=None)
# Ontology (wiki-enabled stores only: PostgresMetadataStore, InMemoryMetadataStore)
# store.store_ontology(contract_name, contract_version, ontology_dict)
# ontology = store.get_ontology(contract_name, contract_version)
Runtime Validation Functions¶
# Validate with store (retrieves all components automatically)
result = validate_with_store(store, schema_id, data, version=None, strict=False)
# Validate batch with store
results = validate_batch_with_store(store, schema_id, data_list, version=None, strict=False)
# Get model from store (for multiple validations)
Model = get_model_from_store(store, schema_id, model_name=None, version=None)
Comparison: Combined vs Separated Workflow¶
| Aspect | Combined Workflow | Separated Workflow |
|---|---|---|
| Contract File | Single file with all components | Components stored separately |
| Business Input | Part of contract file | Separate metadata |
| Developer Input | Part of contract file | Separate schema |
| Rules | Embedded in schema | Stored separately |
| Versioning | Single version for all | Independent versions |
| Updates | Update entire contract | Update components independently |
| Collaboration | Requires coordination | Parallel work possible |
| Runtime | Parse contract → generate model | Retrieve → merge → generate model |
When to Use Separated Workflow¶
Use the separated workflow when: - Business and developer teams work independently - You need independent versioning of components - Schemas change frequently but metadata is stable (or vice versa) - You want maximum flexibility in runtime validation - You're building a large-scale data contract management system
See the Example Notebooks for workflow examples.