Skip to content

Data Contracts & Validation

Learn to define data contracts and validate data against them.

What You'll Learn

  • Define JSON Schema-based contracts
  • Add coercion rules for type conversion
  • Create custom validation rules
  • Use the Validator class effectively
  • Handle validation errors
  • Integrate validation into pipelines

Prerequisites

pip install pycharter

Part 1: Understanding Data Contracts

A data contract is a formal specification defining:

  1. Schema - Structure and types (JSON Schema)
  2. Coercion Rules - Pre-validation transformations
  3. Validation Rules - Post-validation business constraints
  4. Metadata - Description, ownership, versioning

Contract File Structure

user_contract.yaml
schema:
  type: object
  version: "1.0.0"
  properties:
    id:
      type: integer
    name:
      type: string
      minLength: 1
    email:
      type: string
      format: email
    age:
      type: integer
      minimum: 0
      coercion: coerce_to_integer
      validations:
        is_positive: {}
  required:
    - id
    - name
    - email

metadata:
  title: User Contract
  description: Defines user record structure
  version: "1.0.0"

ownership:
  owner: data-team
  steward: alice@example.com
schema.yaml
type: object
version: "1.0.0"
properties:
  id:
    type: integer
  name:
    type: string
    minLength: 1
  email:
    type: string
    format: email
  age:
    type: integer
    minimum: 0
required:
  - id
  - name
  - email
coercion_rules.yaml
version: "1.0.0"
rules:
  age: coerce_to_integer
  id: coerce_to_integer
validation_rules.yaml
version: "1.0.0"
rules:
  age:
    is_positive: {}
  email:
    is_email: {}

Part 2: Schema Definition

Basic Types

properties:
  # String
  name:
    type: string
    minLength: 1
    maxLength: 100

  # Integer
  age:
    type: integer
    minimum: 0
    maximum: 150

  # Number (float)
  price:
    type: number
    minimum: 0
    exclusiveMaximum: 10000

  # Boolean
  active:
    type: boolean

  # Array
  tags:
    type: array
    items:
      type: string
    minItems: 1
    uniqueItems: true

  # Object
  address:
    type: object
    properties:
      street:
        type: string
      city:
        type: string
    required:
      - city

String Constraints

properties:
  username:
    type: string
    minLength: 3
    maxLength: 20
    pattern: "^[a-z0-9_]+$"  # lowercase, numbers, underscore only

  email:
    type: string
    format: email

  status:
    type: string
    enum: ["active", "inactive", "pending"]

  country_code:
    type: string
    const: "US"  # Fixed value

Numeric Constraints

properties:
  age:
    type: integer
    minimum: 0
    maximum: 150

  price:
    type: number
    minimum: 0
    exclusiveMaximum: 10000
    multipleOf: 0.01  # Two decimal places

  quantity:
    type: integer
    minimum: 1
    default: 1

Nested Objects

properties:
  user:
    type: object
    properties:
      profile:
        type: object
        properties:
          bio:
            type: string
          avatar_url:
            type: string
            format: uri
        required:
          - bio
      settings:
        type: object
        properties:
          notifications:
            type: boolean
            default: true
    required:
      - profile

Arrays

properties:
  # Simple array
  tags:
    type: array
    items:
      type: string
    minItems: 1
    maxItems: 10
    uniqueItems: true

  # Array of objects
  addresses:
    type: array
    items:
      type: object
      properties:
        type:
          type: string
          enum: ["home", "work", "other"]
        street:
          type: string
        city:
          type: string
      required:
        - type
        - city
    minItems: 1

Part 3: Coercion Rules

Coercion transforms data before validation:

Built-in Coercions

Coercion Input → Output Example
coerce_to_string any → str 123 → "123"
coerce_to_integer str/float → int "42" → 42
coerce_to_float str/int → float "3.14" → 3.14
coerce_to_boolean str/int → bool "true" → True
coerce_to_datetime str → datetime "2024-01-01" → datetime
coerce_to_date str → date "2024-01-01" → date
coerce_to_lowercase str → str "HELLO" → "hello"
coerce_to_uppercase str → str "hello" → "HELLO"
coerce_to_stripped_string str → str " hi " → "hi"
coerce_to_list any → list "a" → ["a"]
coerce_empty_to_null empty → None "" → None

Inline Coercion

Add coercion directly in the schema:

properties:
  age:
    type: integer
    coercion: coerce_to_integer

  email:
    type: string
    coercion: coerce_to_lowercase

  tags:
    type: array
    coercion: coerce_to_list

Separate Coercion Rules

coercion_rules.yaml
version: "1.0.0"
rules:
  age: coerce_to_integer
  price: coerce_to_float
  email: coerce_to_lowercase
  status: coerce_to_uppercase

Custom Coercion

Register custom coercion functions:

from pycharter.shared.coercions import register_coercion

def coerce_phone_number(value):
    """Remove all non-digit characters from phone number."""
    if isinstance(value, str):
        return ''.join(c for c in value if c.isdigit())
    return value

register_coercion("coerce_phone_number", coerce_phone_number)

Use in schema:

properties:
  phone:
    type: string
    coercion: coerce_phone_number

Part 4: Validation Rules

Validation checks data after schema validation:

Built-in Validations

Validation Description Config
min_length Minimum length {"threshold": 3}
max_length Maximum length {"threshold": 100}
is_positive Value > 0 {}
is_email Valid email {}
is_url Valid URL {}
is_alphanumeric Only letters/numbers {}
is_numeric_string Numeric string {}
matches_regex Match pattern {"pattern": "..."}
only_allow Whitelist {"allowed_values": [...]}
no_capital_characters No uppercase {}
no_special_characters No special chars {}
non_empty_string Not empty {}
is_unique Unique array items {}

Inline Validation

Add validations directly in the schema:

properties:
  age:
    type: integer
    validations:
      is_positive: {}
      less_than_or_equal_to:
        threshold: 150

  username:
    type: string
    validations:
      min_length:
        threshold: 3
      max_length:
        threshold: 20
      matches_regex:
        pattern: "^[a-z0-9_]+$"

  status:
    type: string
    validations:
      only_allow:
        allowed_values: ["active", "inactive", "pending"]

Separate Validation Rules

validation_rules.yaml
version: "1.0.0"
rules:
  age:
    is_positive: {}
    less_than_or_equal_to:
      threshold: 150
  username:
    min_length:
      threshold: 3
    no_special_characters: {}
  email:
    is_email: {}

Custom Validation

Register custom validation functions:

from pycharter.shared.validations import register_validation

def is_valid_phone(min_digits=10):
    """Validate phone number has minimum digits."""
    def _validate(value, info):
        if value is None:
            return value
        digits = ''.join(c for c in str(value) if c.isdigit())
        if len(digits) < min_digits:
            raise ValueError(f"Phone number must have at least {min_digits} digits")
        return value
    return _validate

register_validation("is_valid_phone", is_valid_phone)

Use in schema:

properties:
  phone:
    type: string
    validations:
      is_valid_phone:
        min_digits: 10

Part 5: Using the Validator

Creating Validators

from pycharter import Validator

# From single contract file
validator = Validator.from_file("user_contract.yaml")

# From directory (schema.yaml, coercion_rules.yaml, validation_rules.yaml)
validator = Validator.from_dir("contracts/user/")

# From explicit files
validator = Validator.from_files(
    schema="schemas/user.yaml",
    coercion_rules="rules/coercion.yaml",
    validation_rules="rules/validation.yaml"
)

# From dictionaries
validator = Validator.from_dict(
    schema=schema_dict,
    coercion_rules=coercion_dict,
    validation_rules=validation_dict
)

# From metadata store
from pycharter import SQLiteMetadataStore

store = SQLiteMetadataStore("metadata.db")
store.connect()
validator = Validator(store=store, schema_id="user_schema_v1")

Single Record Validation

result = validator.validate({
    "id": "123",      # Will be coerced to 123
    "name": "Alice",
    "email": "ALICE@EXAMPLE.COM",  # Will be lowercased
    "age": "30"       # Will be coerced to 30
})

if result.is_valid:
    print(f"Valid data: {result.data}")
    # Access as Pydantic model
    print(f"Name: {result.data.name}")
    print(f"Email: {result.data.email}")
else:
    print(f"Validation errors: {result.errors}")

Batch Validation

records = [
    {"id": 1, "name": "Alice", "email": "alice@example.com", "age": 30},
    {"id": 2, "name": "", "email": "invalid", "age": -5},  # Invalid
    {"id": 3, "name": "Charlie", "email": "charlie@example.com", "age": 25},
]

results = validator.validate_batch(records)

valid_count = sum(1 for r in results if r.is_valid)
print(f"Valid: {valid_count}/{len(results)}")

# Process valid records
for result in results:
    if result.is_valid:
        process_record(result.data)
    else:
        log_errors(result.errors)

Strict Mode

Raise exceptions instead of returning errors:

from pydantic import ValidationError

try:
    result = validator.validate(data, strict=True)
    # If we get here, data is valid
    process(result.data)
except ValidationError as e:
    print(f"Validation failed: {e}")

Getting the Model

Access the generated Pydantic model:

# Get the model class
UserModel = validator.get_model()

# Use it directly
user = UserModel(id=1, name="Alice", email="alice@example.com", age=30)
print(user.model_dump())

# Export schema
print(UserModel.model_json_schema())

Part 6: Handling Errors

ValidationResult Structure

result = validator.validate(data)

# Check validity
if result.is_valid:
    # Access validated data (Pydantic model instance)
    validated = result.data
    print(validated.name)
else:
    # Access errors
    for error in result.errors:
        print(f"Field: {error['loc']}")
        print(f"Message: {error['msg']}")
        print(f"Type: {error['type']}")

Error Types

Error Type Description
string_too_short String below minLength
string_too_long String above maxLength
string_pattern_mismatch Pattern not matched
missing Required field missing
int_parsing Cannot parse as integer
value_error Custom validation failed
enum Value not in enum

Custom Error Messages

from pycharter.shared.validations import register_validation

def is_adult(min_age=18):
    def _validate(value, info):
        if value < min_age:
            raise ValueError(f"Must be at least {min_age} years old")
        return value
    return _validate

register_validation("is_adult", is_adult)

Part 7: Integration Patterns

With ETL Pipelines

from pycharter import Pipeline, HTTPExtractor, FileLoader, CustomFunction, Validator

validator = Validator.from_file("contracts/user.yaml")

def validate_records(records):
    """Filter to only valid records."""
    valid = []
    for record in records:
        result = validator.validate(record)
        if result.is_valid:
            valid.append(result.data.model_dump())
    return valid

pipeline = (
    Pipeline(HTTPExtractor(url="https://api.example.com/users"))
    | CustomFunction(validate_records)
    | FileLoader(path="output/valid_users.json")
)

With FastAPI

from fastapi import FastAPI, HTTPException
from pycharter import Validator

app = FastAPI()
validator = Validator.from_file("contracts/user.yaml")

@app.post("/users")
async def create_user(data: dict):
    result = validator.validate(data)
    if not result.is_valid:
        raise HTTPException(status_code=422, detail=result.errors)

    # Process valid data
    return {"id": save_user(result.data)}

Decorator Pattern

from pycharter import validate_input, validate_output

@validate_input("contracts/user_input.yaml")
@validate_output("contracts/user_output.yaml")
def process_user(data: dict) -> dict:
    # Input is already validated
    # Output will be validated before return
    return {
        "id": generate_id(),
        "name": data["name"],
        "email": data["email"],
        "created_at": datetime.now().isoformat()
    }

Exercises

  1. Basic Contract: Create a contract for a Product with fields: id, name, price, category, and in_stock.

  2. Coercion: Add coercion rules to handle string inputs for numeric fields.

  3. Validation: Add validation rules to ensure price is positive and category is one of ["electronics", "clothing", "food"].

  4. Custom Validation: Create a custom validation that checks if a product name doesn't contain special characters.

  5. Integration: Use the validator in an ETL pipeline that filters out invalid products.

Next Steps