Skip to content

The Complete Data Journey: From Contract to Runtime Validation

This document describes the complete data production journey using PyCharter, from initial contract specification (collaboration between business units and developers) to runtime validation in production systems.

Overview: The 6-Stage Journey

Business Unit (Data Definitions & Governance)
Developer (Data Types & Technical Constraints)
Contract File (YAML/JSON) - Single Source of Truth
[PyCharter Services: Parse → Store → Build → Generate → Validate]
Runtime Validation (Production Systems)
(Optional) REST API for Service Integration

PyCharter Core Services: 1. Contract Parser: Decomposes contract files into components 2. Metadata Store: Stores components (PostgreSQL, MongoDB, Redis, or InMemory) 3. Contract Builder: Reconstructs complete contracts from stored components 4. Pydantic Generator: Generates Pydantic models from JSON Schemas 5. JSON Schema Converter: Converts Pydantic models to JSON Schemas 6. Runtime Validator: Validates data against contracts 7. Quality Assurance: Data quality checks, metrics, violation tracking, and profiling 8. REST API (optional): HTTP endpoints for all services


Stage 1: Contract Creation (Business + Developer Collaboration)

Participants: - Business Unit: Defines data definitions, governance rules, ownership, and business requirements - Developer: Adds technical constraints, data types, formats, and validation rules

What Happens: - Business stakeholders define what data means, who owns it, and governance policies - Developers add technical specifications: data types, formats (UUID, email, date-time), validation constraints (minLength, maxLength, pattern), and technical requirements - Together they create a data contract file (YAML or JSON) that serves as the single source of truth

Contract File Structure (data/examples/book/book_contract.yaml):

schema:                    # Developer: Technical schema definition
  type: object
  properties:
    user_id:
      type: string
      format: uuid         # Developer: Technical constraint
      description: Unique identifier for the user
    username:
      type: string
      minLength: 3         # Developer: Technical validation
      maxLength: 20
      pattern: "^[a-z0-9_]+$"
      description: Username (lowercase alphanumeric and underscores only)
    email:
      type: string
      format: email        # Developer: Technical format
      description: User's email address
    age:
      type: integer
      minimum: 0           # Developer: Business rule enforcement
      maximum: 150
      description: User's age in years
    created_at:
      type: string
      format: date-time
      description: Account creation timestamp
  required:
    - user_id
    - username
    - email
    - created_at

governance_rules:          # Business: Data governance policies
  data_retention:
    days: 365              # Business: Retention policy
    description: User data should be retained for 365 days
  pii_fields:              # Business: Privacy requirements
    fields:
      - email
      - user_id
    description: Fields containing personally identifiable information
  access_control:
    level: restricted
    description: User data requires restricted access

ownership:                 # Business: Ownership information
  owner: data-team
  team: engineering
  contact: data-team@example.com
  description: Data team owns user data contracts

metadata:                  # Both: Versioning and documentation
  version: "1.0.0"
  description: User data contract for authentication and profile management
  created: "2024-01-01"
  last_updated: "2024-01-15"

# Optional: ontology (semantic field annotations; see Wiki / Concepts)
# ontology:
#   version: "1.0.0"
#   fields:
#     email: { concept: user_email, definition: "Primary email address" }

Key Points: - Contract file is the single source of truth for data structure and rules - Separates business concerns (governance, ownership) from technical concerns (types, constraints) - Versioned for tracking changes over time - Human-readable format (YAML or JSON) for collaboration

Output: A contract file that combines business requirements with technical specifications


Stage 2: Contract Parsing (Developer)

Service: pycharter.contract_parser

Who: Developer (automated process, typically in CI/CD or setup scripts)

What Happens: - Contract file is parsed and decomposed into structured components - Separates schema, governance_rules, ownership, metadata, and optional ontology into distinct objects - Returns a ContractMetadata object that makes each component accessible independently

Code Example:

from pycharter import parse_contract_file, ContractMetadata

# Parse the contract file (YAML or JSON)
metadata = parse_contract_file("data/examples/book/book_contract.yaml")

# Access decomposed components
schema = metadata.schema              # JSON Schema for model generation
governance = metadata.governance_rules # For governance enforcement
ownership = metadata.ownership         # For access control
metadata_info = metadata.metadata      # Version, description, etc.
ontology = metadata.ontology          # Optional semantic field annotations

# Or parse from a dictionary
contract_dict = {
    "schema": {...},
    "governance_rules": {...},
    "ownership": {...},
    "metadata": {...}
}
metadata = parse_contract(contract_dict)

Why This Matters: - Separates concerns: schema can be used independently for model generation - Governance rules can be enforced separately - Ownership information can be used for access control - Metadata can be used for versioning and documentation

Output: Structured ContractMetadata object with separated components


Stage 3: Metadata Storage (Developer)

Service: pycharter.metadata_store

Who: Developer (one-time setup, then automated)

What Happens: - Decomposed metadata components are stored in a metadata store (database or in-memory) - Multiple store implementations available: PostgreSQL, MongoDB, Redis, or InMemory - Schemas are versioned for evolution tracking - Governance rules and ownership are stored as part of metadata (not separate entities) - Enables querying, versioning, and retrieval of stored metadata

Available Metadata Store Implementations:

  1. PostgresMetadataStore: PostgreSQL database (requires psycopg2-binary)
  2. Requires schema initialization: pycharter db init <connection_string>
  3. Supports migrations: pycharter db upgrade
  4. Optional seeding: pycharter db seed <seed_dir> <connection_string>

  5. MongoDBMetadataStore: MongoDB database (requires pymongo)

  6. Auto-creates indexes on first connection
  7. No separate initialization needed

  8. InMemoryMetadataStore: In-memory storage (no dependencies)

  9. Perfect for testing and development
  10. Data is lost when process ends

  11. RedisMetadataStore: Redis database (requires redis)

  12. Fast key-value storage
  13. Good for caching and high-throughput scenarios

Code Example:

from pycharter import PostgresMetadataStore, MongoDBMetadataStore, InMemoryMetadataStore

# Option 1: PostgreSQL (requires: pycharter db init first)
store = PostgresMetadataStore(connection_string="postgresql://user:pass@localhost/db")
store.connect()

# Option 2: MongoDB (auto-initializes)
store = MongoDBMetadataStore(
    connection_string="mongodb://user:pass@localhost:27017/",
    database_name="pycharter"
)
store.connect()

# Option 3: InMemory (for testing)
store = InMemoryMetadataStore()
store.connect()

# Store schema with versioning
schema_id = store.store_schema(
    schema_name="user",
    schema=metadata.schema,
    version=metadata.versions.get("schema", "1.0.0")
)

# Merge ownership and governance rules into metadata before storing
# Ownership and governance are part of metadata, not separate entities
metadata_dict = metadata.metadata.copy() if metadata.metadata else {}
if metadata.ownership:
    metadata_dict["business_owners"] = [metadata.ownership.get("owner", "unknown")] if metadata.ownership.get("owner") else []
if metadata.governance_rules:
    metadata_dict["governance_rules"] = metadata.governance_rules

# Store metadata once with all information (ownership and governance included)
if metadata_dict:
    store.store_metadata(
        resource_id=schema_id,
        resource_type="schema",
        metadata=metadata_dict
    )

# Store coercion rules (if extracted from schema)
if metadata.coercion_rules:
    store.store_coercion_rules(
        schema_id=schema_id,
        coercion_rules=metadata.coercion_rules,
        version=metadata.versions.get("coercion_rules", "1.0.0")
    )

# Store validation rules (if extracted from schema)
if metadata.validation_rules:
    store.store_validation_rules(
        schema_id=schema_id,
        validation_rules=metadata.validation_rules,
        version=metadata.versions.get("validation_rules", "1.0.0")
    )

store.disconnect()

Database Initialization (PostgreSQL only):

# Initialize database schema
pycharter db init postgresql://user:pass@localhost/pycharter

# Apply migrations
pycharter db upgrade postgresql://user:pass@localhost/pycharter

# (Optional) Seed initial data (owners, domains, systems)
pycharter db seed data/seed postgresql://user:pass@localhost/pycharter

Why This Matters: - Versioning: Track schema evolution over time - Queryability: Find schemas by name, version, owner, etc. - Audit Trail: Know who owns what and when it changed - Centralized Storage: Single source of truth in database - Multi-Application: Multiple applications can retrieve same schemas - Flexibility: Choose the right store for your use case (PostgreSQL for ACID, MongoDB for flexibility, InMemory for testing)

Output: All metadata stored in database, versioned and queryable


Stage 4: Model Generation (Developer - On-Demand)

Service: pycharter.pydantic_generator and pycharter.contract_builder

Who: Developer (in application code, ETL scripts, APIs)

What Happens: - Retrieve schema from metadata store (or use directly from parsed contract) - Optionally rebuild complete contract from store using Contract Builder - Dynamically generate Pydantic model class at runtime - Model includes all validations, constraints, and types from the schema - Model is fully functional and can be used like any Pydantic model

Code Example:

from pycharter import (
    from_dict, 
    build_contract_from_store,
    get_model_from_contract,
    get_model_from_store
)

# Option 1: Get schema from metadata store and generate model
store.connect()
schema = store.get_schema(schema_id)  # Retrieve stored schema
UserModel = from_dict(schema, "User")
store.disconnect()

# Option 2: Rebuild complete contract from store (includes metadata, rules, etc.)
store.connect()
reconstructed_contract = build_contract_from_store(
    store=store,
    schema_id=schema_id,
    version=None,  # Use latest version
    include_metadata=True,
    include_ownership=True,
    include_governance=True
)
# Contract has: schema (RAW), coercion_rules (separate), validation_rules (separate)
# Generate model - get_model_from_contract handles merging internally
UserModel = get_model_from_contract(
    contract=reconstructed_contract,
    model_name="User"
)
store.disconnect()

# Option 3: Use convenience function to get model directly from store
store.connect()
UserModel = get_model_from_store(
    store=store,
    schema_id=schema_id,
    model_name="User",
    version=None
)
store.disconnect()

# Option 4: Use schema directly from parsed contract
schema = metadata.schema
UserModel = from_dict(schema, "User")

# Option 5: Load from file
schema = json.load(open("schema.json"))
UserModel = from_dict(schema, "User")

# Now you have a fully-functional Pydantic model!
# It includes all validations: UUID format, email format, minLength, etc.

# Use it like any Pydantic model
user = UserModel(
    user_id="123e4567-e89b-12d3-a456-426614174000",
    username="alice",
    email="alice@example.com",
    age=30,
    created_at="2024-01-15T10:30:00Z"
)

# Validation happens automatically
try:
    invalid_user = UserModel(username="ab", email="not-email", age=-5)
except ValidationError as e:
    print("Validation failed:", e)

Why This Matters: - Dynamic Generation: No need to write model classes manually - Always Up-to-Date: Models generated from latest schema version - Type Safety: Full Pydantic type checking and validation - Runtime Flexibility: Generate models on-demand when needed - Consistency: All applications use same schema definitions - Complete Contracts: Contract Builder consolidates all components; schema is raw and rules are separate; Validator merges rules internally during validation

Output: A Pydantic model class ready for validation


Stage 5: Runtime Validation (Developer - In Production)

Service: pycharter.runtime_validator

Who: Developer (in production code: ETL pipelines, APIs, data processing scripts)

What Happens: - Validate incoming data against the generated Pydantic model - Catch contract violations early before data enters your system - Return structured validation results with error details - Support both single record and batch validation

Code Example:

from pycharter import validate, validate_batch, ValidationResult

# Single record validation (e.g., API endpoint)
def process_api_request(incoming_data: dict):
    result: ValidationResult = validate(UserModel, incoming_data)

    if result.is_valid:
        # Data passes all validations from contract
        validated_user = result.data
        # Continue processing...
        save_to_database(validated_user)
        return {"status": "success", "user_id": validated_user.user_id}
    else:
        # Contract violations detected
        return {
            "status": "error",
            "errors": result.errors,
            "message": "Data does not conform to contract"
        }

# Batch validation (e.g., ETL pipeline processing CSV or database records)
def process_etl_batch(batch_of_records: list[dict]):
    results = validate_batch(UserModel, batch_of_records)

    valid_records = [r.data for r in results if r.is_valid]
    invalid_records = [
        {"data": batch_of_records[i], "errors": r.errors}
        for i, r in enumerate(results) if not r.is_valid
    ]

    # Process valid records
    save_to_database(valid_records)

    # Handle invalid records (log, send to DLQ, etc.)
    if invalid_records:
        log_validation_errors(invalid_records)
        send_to_dead_letter_queue(invalid_records)

    return {
        "processed": len(batch_of_records),
        "valid": len(valid_records),
        "invalid": len(invalid_records)
    }

# Strict mode (raises exceptions instead of returning results)
def strict_validation(data: dict):
    try:
        result = validate(UserModel, data, strict=True)
        return result.data
    except ValidationError as e:
        # Handle exception
        raise ValueError(f"Contract violation: {e}")

Why This Matters: - Early Detection: Catch data quality issues before they propagate - Contract Enforcement: Ensure all data conforms to business and technical rules - Error Reporting: Detailed error messages help identify issues - Production Safety: Prevent bad data from entering your systems - Batch Processing: Efficiently validate large datasets

Output: Validated data ready for processing, or error information for invalid data


Stage 6: Quality Assurance (Developer - In Production)

Service: pycharter.quality

Who: Developer (in production code: ETL pipelines, data quality monitoring, scheduled jobs)

What Happens: - Run quality checks against data to measure data quality metrics - Track violations and quality scores over time - Profile data to understand its characteristics - Persist quality metrics and violations to database for historical tracking - Monitor data quality trends and alert on threshold breaches

Code Example:

from pycharter import QualityCheck, QualityCheckOptions, DataProfiler
from pycharter.db.models.base import get_session
from pycharter.config import get_database_url

# Initialize quality check with database session for persistence
db_url = get_database_url()
db_session = get_session(db_url)

check = QualityCheck(store=store, db_session=db_session)

# Run quality check with profiling enabled
report = check.run(
    schema_id="user_schema_v1",
    data="data/users.json",
    options=QualityCheckOptions(
        calculate_metrics=True,
        record_violations=True,
        include_profiling=True,  # Enable data profiling
        check_thresholds=True,
        thresholds=QualityThresholds(
            min_overall_score=95.0,
            max_violation_rate=0.05,
            min_completeness=0.95,
            min_accuracy=0.95
        )
    )
)

# Access quality metrics
print(f"Overall Score: {report.quality_score.overall_score:.2f}/100")
print(f"Accuracy: {report.quality_score.accuracy:.2%}")
print(f"Completeness: {report.quality_score.completeness:.2%}")
print(f"Violation Rate: {report.quality_score.violation_rate:.2%}")

# Access profiling data
if "profiling" in report.metadata:
    profile = report.metadata["profiling"]
    print(f"Record Count: {profile['record_count']}")
    print(f"Average Completeness: {profile['overall_stats']['average_completeness']:.2%}")

# Check threshold breaches
if report.threshold_breaches:
    print("⚠ Threshold Breaches:")
    for breach in report.threshold_breaches:
        print(f"  - {breach}")

# Query violations from database
violations = check.violation_tracker.get_violations(
    schema_id="user_schema_v1",
    status="open",
    severity="critical"
)

# Get violation summary
summary = check.violation_tracker.get_violation_summary(schema_id="user_schema_v1")
print(f"Total Violations: {summary['total']}")
print(f"Open: {summary['open']}, Resolved: {summary['resolved']}")

# Standalone data profiling
profiler = DataProfiler()
profile = profiler.profile(data_list)
print(f"Field Profiles: {profile['field_profiles']}")

Quality Metrics: - Overall Score: Composite quality score (0-100) - Accuracy: Percentage of valid records - Completeness: Percentage of required fields present - Violation Rate: Percentage of records with violations - Field-Level Metrics: Per-field completeness and violation rates

Violation Tracking: - Individual violations are recorded with details (field, error type, error message) - Violations can be filtered by schema, status, severity, date range - Violations can be resolved and tracked over time - All violations are persisted to database when database session is provided

Data Profiling: - Statistical analysis of data characteristics - Per-field profiling: null counts, data types, unique values, distributions - Type-specific statistics: numeric (min/max/mean/median/std_dev), string (length stats), boolean (distribution) - Most common values and patterns

Database Persistence: - Quality metrics are automatically saved to quality_metrics table - Violations are automatically saved to quality_violations table - Historical tracking enables trend analysis and monitoring - All metrics include timestamps for time-series analysis

Why This Matters: - Proactive Monitoring: Detect data quality issues before they impact downstream systems - Historical Tracking: Track quality trends over time - Violation Management: Track and resolve data quality issues systematically - Data Understanding: Profiling helps understand data characteristics and patterns - Threshold Alerting: Automatic alerts when quality falls below thresholds - Production Safety: Ensure data quality meets business requirements

Output: Quality report with metrics, violations, and profiling data (optionally persisted to database)


Complete End-to-End Flow

Developer Workflow Example

Here's a complete example showing all stages working together:

from pycharter import (
    parse_contract_file,
    MetadataStoreClient,
    from_dict,
    validate,
    ValidationResult
)

# ============================================================
# SETUP PHASE (One-time or when contract changes)
# ============================================================

# Step 1: Parse contract (from business + developer collaboration)
metadata = parse_contract_file("data/examples/book/book_contract.yaml")

# Step 2: Store in database
# First, initialize database (PostgreSQL only)
# Run: pycharter db init postgresql://user:pass@localhost/pycharter

store = PostgresMetadataStore(connection_string="postgresql://...")
store.connect()

schema_id = store.store_schema(
    schema_name="user",
    schema=metadata.schema,
    version=metadata.versions.get("schema", "1.0.0")
)

# Merge ownership and governance rules into metadata before storing
# Ownership and governance are part of metadata, not separate entities
metadata_dict = metadata.metadata.copy() if metadata.metadata else {}
if metadata.ownership:
    metadata_dict["business_owners"] = [metadata.ownership.get("owner", "unknown")] if metadata.ownership.get("owner") else []
if metadata.governance_rules:
    metadata_dict["governance_rules"] = metadata.governance_rules

# Store metadata once with all information (ownership and governance included)
if metadata_dict:
    store.store_metadata(
        resource_id=schema_id,
        resource_type="schema",
        metadata=metadata_dict
    )

store.disconnect()

# ============================================================
# RUNTIME PHASE (In production code)
# ============================================================

# Step 3: Retrieve schema and generate model (on-demand)
store.connect()
schema = store.get_schema(schema_id)  # Get latest version
UserModel = from_dict(schema, "User")  # Generate model
store.disconnect()

# Step 4: Validate incoming data (in your ETL/API/processing code)
def process_incoming_user_data(raw_data: dict):
    result: ValidationResult = validate(UserModel, raw_data)

    if result.is_valid:
        # Data passes all validations from contract
        validated_user = result.data
        # Continue processing...
        return validated_user
    else:
        # Contract violations detected
        raise ValueError(f"Data contract violation: {result.errors}")

Key Benefits of This Journey

1. Single Source of Truth

  • Contract file defines everything: schema, governance, ownership, metadata
  • No duplication or drift between definitions
  • Changes propagate automatically through the system

2. Separation of Concerns

  • Business: Defines governance rules, ownership, retention policies
  • Developer: Defines technical constraints, types, formats, validations
  • Both collaborate in the same contract file but maintain their domains

3. Versioning & Evolution

  • Track schema evolution over time in database
  • Support multiple versions simultaneously
  • Enable gradual migration and rollback

4. Runtime Flexibility

  • Generate models on-demand from stored schemas
  • No need to redeploy code when schemas change
  • Applications automatically use latest schema versions

5. Type Safety & Validation

  • Pydantic models provide full type checking
  • All contract validations enforced automatically
  • Catch errors early in the pipeline

6. Early Error Detection

  • Validate data before it enters your systems
  • Prevent bad data from propagating
  • Detailed error messages help identify issues

7. Multi-Application Support

  • One contract → Stored once → Multiple apps retrieve
  • All applications validate against same contract
  • Ensures consistency across the organization

Real-World Scenarios

Scenario 1: New Contract Version

Flow:

Business updates contract → Developer parses new version → 
Store new version in database → Applications automatically retrieve new version → 
Validation uses updated rules

Code:

# Parse new contract version
metadata_v2 = parse_contract_file("data/examples/book/book_contract.yaml")

# Store as new version
schema_id_v2 = store.store_schema(
    schema_name="user",
    schema=metadata_v2.schema,
    version="2.0.0"  # New version
)

# Applications can query by version
schema_v1 = store.get_schema_by_name_and_version("user", "1.0.0")
schema_v2 = store.get_schema_by_name_and_version("user", "2.0.0")

Scenario 2: Multiple Applications

Flow:

One contract → Stored once in database → 
Multiple apps retrieve schema → Each generates its own model → 
All validate against same contract

Code:

# Application A (ETL Pipeline)
schema = store.get_schema(schema_id)
UserModel = from_dict(schema, "User")
# ... validate ETL data

# Application B (API Service)
schema = store.get_schema(schema_id)  # Same schema
UserModel = from_dict(schema, "User")
# ... validate API requests

# Application C (Data Quality Tool)
schema = store.get_schema(schema_id)  # Same schema
UserModel = from_dict(schema, "User")
# ... validate data quality checks

Scenario 3: Schema Evolution

Flow:

Old schema v1.0 → New schema v1.1 → Both stored in database → 
Apps can query by version → Gradual migration possible → 
Old apps use v1.0, new apps use v1.1

Code:

# Store multiple versions
schema_id_v1 = store.store_schema("user", schema_v1, version="1.0.0")
schema_id_v2 = store.store_schema("user", schema_v2, version="1.1.0")

# Legacy application uses old version
legacy_schema = store.get_schema(schema_id_v1)
LegacyUserModel = from_dict(legacy_schema, "User")

# New application uses new version
new_schema = store.get_schema(schema_id_v2)
NewUserModel = from_dict(new_schema, "User")


Best Practices

1. Contract Design

  • Keep contracts versioned and documented
  • Separate business concerns from technical concerns
  • Use descriptive field names and descriptions
  • Include examples in metadata

2. Storage Strategy

  • Choose the right metadata store for your use case:
  • PostgreSQL: ACID transactions, complex queries, production systems
  • MongoDB: Flexible schema, document storage, rapid development
  • Redis: High-throughput, caching, ephemeral data
  • InMemory: Testing, development, prototyping
  • Store all contract components (schema, metadata with ownership/governance, coercion rules, validation rules, ontology when using wiki-enabled store)
  • Merge ownership and governance into metadata before storing (they are part of metadata, not separate)
  • Use proper versioning (semantic versioning recommended)
  • For PostgreSQL: Initialize schema with pycharter db init and use migrations
  • Implement proper indexing for fast retrieval
  • Consider retention policies for old versions

3. Model Generation

  • Generate models on-demand rather than at startup
  • Cache generated models when appropriate
  • Handle schema changes gracefully
  • Log model generation for debugging

4. Validation Strategy

  • Validate early in your pipeline
  • Use batch validation for large datasets
  • Log validation errors for monitoring
  • Send invalid data to dead letter queues
  • Provide clear error messages to users

5. Monitoring & Observability

  • Track validation success/failure rates
  • Monitor schema version usage
  • Alert on high validation failure rates
  • Track contract evolution over time

Additional Services

Contract Builder (pycharter.contract_builder)

Purpose: Reconstructs complete contracts from separate artifacts stored in metadata store.

When to Use: When you need to rebuild a complete contract from stored components (schema, metadata, rules) for runtime validation or contract export.

Example:

from pycharter import build_contract_from_store

# Rebuild complete contract from store
contract = build_contract_from_store(
    store=store,
    schema_title=schema_id,
    schema_version=None,  # Use latest version
    include_metadata=True,
    include_ownership=True,
    include_governance=True
)

# Contract contains: schema (raw), coercion_rules and validation_rules (separate), metadata, ownership, governance_rules, versions

REST API (api/)

Purpose: HTTP endpoints for all PyCharter services using FastAPI.

Location: Root-level api/ directory (separate from pycharter/ package)

When to Use: When you need to access PyCharter services via HTTP, integrate with other services, or provide a web interface.

Installation:

pip install pycharter[api]

Running:

pycharter api
# Or: uvicorn api.main:app --reload

Documentation: - Swagger UI: http://localhost:8000/docs - ReDoc: http://localhost:8000/redoc

See api/README.md for complete API documentation.

Summary

The PyCharter data journey provides a complete solution for managing data contracts from specification to runtime validation and quality assurance:

  1. Contract Creation: Business and developers collaborate to define data contracts
  2. Contract Parsing: Decompose contracts into structured components
  3. Metadata Storage: Store components in metadata store (PostgreSQL, MongoDB, Redis, or InMemory) with versioning
  4. Contract Building: Reconstruct complete contracts from stored components
  5. Model Generation: Dynamically generate Pydantic models from schemas
  6. Runtime Validation: Validate data against contracts in production
  7. Quality Assurance: Monitor data quality, track violations, and profile data characteristics

This journey ensures: - ✅ Single source of truth (contract files) - ✅ Separation of concerns (business vs. technical) - ✅ Versioning and evolution tracking - ✅ Runtime flexibility and type safety - ✅ Early error detection and prevention - ✅ Consistency across multiple applications - ✅ Multiple storage backends (PostgreSQL, MongoDB, Redis, InMemory) - ✅ REST API for service integration - ✅ Data quality monitoring and violation tracking - ✅ Historical quality metrics and trend analysis - ✅ Data profiling for understanding data characteristics

By following this journey, you can maintain data quality, enforce contracts, and ensure consistency across your entire data infrastructure.


Separated Workflow: Schema, Metadata, and Rules Stored Separately

The separated workflow is an improved approach where schemas, metadata, and coercion/validation rules are stored and managed separately, enabling better collaboration between business units and developers.

Overview

The separated workflow addresses the need for: 1. Business units to define metadata (ownership, governance, versioning) independently 2. Developers to define schemas (as Pydantic models) independently 3. Both to collaborate on coercion and validation rules 4. Runtime to retrieve and combine all components automatically

The Separated Workflow

┌─────────────────────────────────────────────────────────────┐
│ Step 1: Business Unit Provides Metadata                     │
│ - Ownership information                                      │
│ - Governance rules                                           │
│ - Version information                                        │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Step 2: Developer Writes Pydantic Model                    │
│ - Define data types and structure                           │
│ - Convert to JSON Schema                                    │
│ - Store schema separately                                   │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Step 3: Developer + Business Define Rules                  │
│ - Coercion rules (data transformation)                      │
│ - Validation rules (business + technical checks)            │
│ - Store rules separately                                    │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Step 4: Store All Components Separately                     │
│ - Schema stored in database                                 │
│ - Metadata stored in database                               │
│ - Coercion rules stored in database                         │
│ - Validation rules stored in database                       │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Step 5: Runtime Validation                                  │
│ - Retrieve schema from store                                │
│ - Retrieve coercion rules from store                        │
│ - Retrieve validation rules from store                      │
│ - Merge all components                                      │
│ - Generate Pydantic model                                   │
│ - Validate data                                             │
└─────────────────────────────────────────────────────────────┘

Step-by-Step: Separated Workflow

Step 1: Business Unit Provides Metadata

business_metadata = {
    "owner": "data-team",
    "team": "engineering",
    "contact": "data-team@example.com",
    "description": "User data contract for authentication",
    "governance_rules": {
        "data_retention": {"days": 365},
        "pii_fields": {"fields": ["email", "user_id"]},
    },
    "version": "1.0.0",
}

Step 2: Developer Writes Pydantic Model and Converts to JSON Schema

from pydantic import BaseModel, Field
from pycharter import to_dict

# Developer writes Pydantic model
class User(BaseModel):
    user_id: str = Field(..., description="Unique identifier")
    username: str = Field(..., min_length=3, max_length=20)
    email: str = Field(..., description="User's email address")
    age: int = Field(..., ge=0, le=150)

# Convert to JSON Schema
schema = to_dict(User)

Step 3: Developer + Business Define Coercion and Validation Rules

# Coercion rules (data transformation before validation)
coercion_rules = {
    "user_id": "coerce_to_string",
    "age": "coerce_to_integer",
}

# Validation rules (additional checks after validation)
validation_rules = {
    "username": {
        "no_capital_characters": None,  # Business requirement
        "min_length": {"threshold": 3},  # Developer constraint
    },
    "age": {
        "is_positive": {"threshold": 0},  # Business requirement
    },
}

Step 4: Store All Components Separately

from pycharter import InMemoryMetadataStore, PostgresMetadataStore, MongoDBMetadataStore

# Choose your store implementation
store = InMemoryMetadataStore()  # or PostgresMetadataStore, MongoDBMetadataStore, etc.
store.connect()

# Store schema (from developer)
schema_id = store.store_schema("user", schema, version="1.0.0")

# Merge ownership and governance into metadata before storing
# Ownership and governance are part of metadata, not separate entities
metadata_dict = business_metadata.copy() if business_metadata else {}
if ownership:
    metadata_dict["business_owners"] = [ownership.get("owner", "unknown")] if ownership.get("owner") else []
if governance_rules:
    metadata_dict["governance_rules"] = governance_rules

# Store metadata once with all information (ownership and governance included)
if metadata_dict:
    store.store_metadata(
        resource_id=schema_id,
        resource_type="schema",
        metadata=metadata_dict
    )

# Store coercion rules (from developer + business)
if coercion_rules:
    store.store_coercion_rules(schema_id, coercion_rules, version="1.0.0")

# Store validation rules (from developer + business)
if validation_rules:
    store.store_validation_rules(schema_id, validation_rules, version="1.0.0")

Step 5: Runtime Validation

from pycharter import validate_with_store, get_model_from_store

# Option 1: Use convenience function (retrieves all and validates)
result = validate_with_store(
    store=store,
    schema_id=schema_id,
    data=incoming_data,
    strict=False,
)

# Option 2: Get model once and reuse
UserModel = get_model_from_store(store, schema_id, "User")
result = validate(UserModel, incoming_data)

Benefits of Separated Workflow

  1. Clear Separation of Concerns: Business owns metadata, developer owns schema, both collaborate on rules
  2. Independent Versioning: Each component can be versioned independently
  3. Independent Updates: Update components without affecting others
  4. Better Collaboration: Business and developer can work in parallel
  5. Runtime Flexibility: Retrieve and combine components on-demand

API Reference: Separated Workflow

Store Components Separately

# Store schema (in schemas table)
schema_id = store.store_schema(schema_name, schema, version=None)

# Store metadata (in metadata_records table)
metadata_record_id = store.store_metadata(resource_id, metadata, resource_type="schema")

# Store coercion rules (in coercion_rules table)
coercion_rules_id = store.store_coercion_rules(schema_id, coercion_rules, version=None)

# Store validation rules (in validation_rules table)
validation_rules_id = store.store_validation_rules(schema_id, validation_rules, version=None)

Retrieve Components

# Get schema only (from schemas table)
schema = store.get_schema(schema_id)

# Get coercion rules only (from coercion_rules table)
coercion_rules = store.get_coercion_rules(schema_id, version=None)

# Get validation rules only (from validation_rules table)
validation_rules = store.get_validation_rules(schema_id, version=None)

# Get metadata record (from metadata_records table)
metadata_record = store.get_metadata(resource_id, resource_type="schema")

# Get complete schema (with rules merged - for display/docs, not for editing)
# Note: For validation, prefer using Validator class which handles merging internally
complete_schema = store.get_complete_schema(schema_id, version=None)

# Ontology (wiki-enabled stores only: PostgresMetadataStore, InMemoryMetadataStore)
# store.store_ontology(contract_name, contract_version, ontology_dict)
# ontology = store.get_ontology(contract_name, contract_version)

Runtime Validation Functions

# Validate with store (retrieves all components automatically)
result = validate_with_store(store, schema_id, data, version=None, strict=False)

# Validate batch with store
results = validate_batch_with_store(store, schema_id, data_list, version=None, strict=False)

# Get model from store (for multiple validations)
Model = get_model_from_store(store, schema_id, model_name=None, version=None)

Comparison: Combined vs Separated Workflow

Aspect Combined Workflow Separated Workflow
Contract File Single file with all components Components stored separately
Business Input Part of contract file Separate metadata
Developer Input Part of contract file Separate schema
Rules Embedded in schema Stored separately
Versioning Single version for all Independent versions
Updates Update entire contract Update components independently
Collaboration Requires coordination Parallel work possible
Runtime Parse contract → generate model Retrieve → merge → generate model

When to Use Separated Workflow

Use the separated workflow when: - Business and developer teams work independently - You need independent versioning of components - Schemas change frequently but metadata is stable (or vice versa) - You want maximum flexibility in runtime validation - You're building a large-scale data contract management system

See the Example Notebooks for workflow examples.