Skip to content

Core Concepts

Understanding PyCharter's key concepts will help you use it effectively.

Data Contracts

A data contract is a formal specification that defines:

  • Schema: The structure and types of your data (JSON Schema)
  • Coercion Rules: Pre-validation transformations (e.g., string → integer)
  • Validation Rules: Post-validation business constraints
  • Metadata: Description, ownership, governance information
  • Ontology (optional): Semantic field annotations—what each field means in the business domain (concepts, definitions, relationships)

Contract Structure

# user_contract.yaml
schema:
  type: object
  version: "1.0.0"
  properties:
    name:
      type: string
      minLength: 1
    email:
      type: string
      format: email
    age:
      type: integer
      coercion: coerce_to_integer  # Pre-validation coercion
      validations:
        is_positive: {}            # Post-validation check
  required:
    - name
    - email

metadata:
  title: User Contract
  description: Defines the structure of user records
  version: "1.0.0"

ownership:
  owner: data-team
  steward: alice@example.com

# Optional: ontology (semantic field annotations; see Wiki / Concepts)
# ontology:
#   version: "1.0.0"
#   fields:
#     email: { concept: user_email, definition: "Primary email address" }

When you use build_contract() or load a contract from the store, the contract dictionary contains a raw schema plus coercion_rules and validation_rules as separate keys (rules are not merged into the schema). The Validator merges rules internally when validating.

Contract Lifecycle

graph LR
    A[Define Contract] --> B[Parse]
    B --> C[Store in Registry]
    C --> D[Generate Validators]
    D --> E[Validate Data]
    E --> F[Monitor Quality]
    F --> A

Validation Pipeline

When data is validated, it goes through three stages:

1. Coercion (Pre-validation)

Transforms input data before validation:

# Input: {"age": "25"}
# After coercion: {"age": 25}  (string → integer)

Built-in coercions:

Coercion Description
coerce_to_string Convert to string
coerce_to_integer Convert to integer
coerce_to_float Convert to float
coerce_to_boolean Convert to boolean
coerce_to_datetime Parse ISO datetime
coerce_to_date Parse date only
coerce_to_lowercase Lowercase string
coerce_to_uppercase Uppercase string

2. Schema Validation

Validates against JSON Schema (Draft 2020-12):

# Checks: type, required, minLength, pattern, enum, etc.

3. Custom Validation (Post-validation)

Applies business rules after schema validation:

# Checks: is_positive, is_email, matches_regex, etc.

Built-in validations:

Validation Description Config
min_length Minimum string/array length {"threshold": N}
max_length Maximum string/array length {"threshold": N}
is_positive Value > 0 {}
is_email Valid email format {}
matches_regex Match pattern {"pattern": "..."}
only_allow Whitelist values {"allowed_values": [...]}

ETL Pipeline Architecture

PyCharter's ETL system follows the Extract-Transform-Load pattern:

graph TB
    subgraph Extract
        E1[HTTPExtractor]
        E2[FileExtractor]
        E3[DatabaseExtractor]
        E4[CloudStorageExtractor]
    end

    subgraph Transform
        T1[Rename]
        T2[Filter]
        T3[AddField]
        T4[Convert]
        T5[CustomFunction]
    end

    subgraph Load
        L1[PostgresLoader]
        L2[FileLoader]
        L3[CloudStorageLoader]
    end

    E1 --> T1
    E2 --> T1
    E3 --> T1
    E4 --> T1

    T1 --> T2 --> T3 --> T4 --> T5

    T5 --> L1
    T5 --> L2
    T5 --> L3

Pipeline Composition

Pipelines are built using the | (pipe) operator:

pipeline = (
    Pipeline(extractor)
    | transformer1
    | transformer2
    | loader
)

Async Execution

All pipeline operations are async for better performance:

import asyncio

# From a script
result = asyncio.run(pipeline.run())

# From async code
result = await pipeline.run()

Metadata Store

The metadata store is a centralized registry for:

  • Schemas and their versions
  • Coercion rules
  • Validation rules
  • Metadata (ownership, description, governance)
  • Ontology (when using a wiki-enabled store)
  • Quality metrics history

Store Backends

Backend Use Case
InMemoryMetadataStore Testing, development
SQLiteMetadataStore Single-user, local development
PostgresMetadataStore Production, multi-user
MongoDBMetadataStore Document-oriented workloads
RedisMetadataStore High-performance caching

Schema Versioning

Schemas are versioned to track changes:

# Store a new version
store.store_schema("user", schema_v1, version="1.0.0")
store.store_schema("user", schema_v2, version="2.0.0")

# Get specific version
schema = store.get_schema("user", version="1.0.0")

# Get latest version
schema = store.get_schema("user")  # Returns 2.0.0

Quality Assurance

PyCharter supports two types of quality checks: contract (row-based) and pipeline (column/dataset-based). See Data quality: contract vs pipeline for an overview and when to use each.

Contract quality metrics (row-based)

Metric Description
overall_score Overall quality (0-100)
violation_rate % of records with errors
completeness % of non-null required fields
accuracy % of valid values

Quality Thresholds

Set alerts when quality drops:

thresholds = QualityThresholds(
    min_overall_score=95.0,    # Alert if score < 95
    max_violation_rate=0.05,   # Alert if violations > 5%
)

Violation Tracking

Every validation error is tracked:

# Query violations
violations = store.query_violations(
    schema_id="user_schema",
    status="open",
    severity="high"
)

API Tiers

PyCharter's API is organized into three tiers:

Best performance, full features:

from pycharter import Validator, Pipeline, QualityCheck

validator = Validator.from_file("contract.yaml")

Tier 2: Convenience Functions

Quick start, one-off operations:

from pycharter import from_dict, validate, validate_with_contract

Model = from_dict(schema, "User")
result = validate(Model, data)

Tier 3: Low-Level Utilities

When you need fine-grained control:

from pycharter import validate_batch, model_to_schema

Error Handling

PyCharter uses a structured exception hierarchy:

from pycharter.shared.errors import (
    PyCharterError,      # Base exception
    ConfigError,         # Config loading/parsing
    ConfigValidationError,  # Schema validation
    ExpressionError,     # Expression evaluation
)

try:
    pipeline = Pipeline.from_config_dir("invalid/")
except ConfigError as e:
    print(f"Config error: {e}")
except PyCharterError as e:
    print(f"PyCharter error: {e}")

Error Modes

Control how pipeline errors are handled:

from pycharter.shared.errors import ErrorMode, ErrorContext

# Strict: raise on first error (default)
result = await pipeline.run(error_context=ErrorContext(mode=ErrorMode.STRICT))

# Lenient: log and continue
result = await pipeline.run(error_context=ErrorContext(mode=ErrorMode.LENIENT))

# Collect: gather all errors
result = await pipeline.run(error_context=ErrorContext(mode=ErrorMode.COLLECT))
print(result.errors)  # List of all errors

Next Steps