Data Contracts & Validation¶
Learn to define data contracts and validate data against them.
What You'll Learn¶
- Define JSON Schema-based contracts
- Add coercion rules for type conversion
- Create custom validation rules
- Use the Validator class effectively
- Handle validation errors
- Integrate validation into pipelines
Prerequisites¶
Part 1: Understanding Data Contracts¶
A data contract is a formal specification defining:
- Schema - Structure and types (JSON Schema)
- Coercion Rules - Pre-validation transformations
- Validation Rules - Post-validation business constraints
- Metadata - Description, ownership, versioning
Contract File Structure¶
schema:
type: object
version: "1.0.0"
properties:
id:
type: integer
name:
type: string
minLength: 1
email:
type: string
format: email
age:
type: integer
minimum: 0
coercion: coerce_to_integer
validations:
is_positive: {}
required:
- id
- name
- email
metadata:
title: User Contract
description: Defines user record structure
version: "1.0.0"
ownership:
owner: data-team
steward: alice@example.com
Part 2: Schema Definition¶
Basic Types¶
properties:
# String
name:
type: string
minLength: 1
maxLength: 100
# Integer
age:
type: integer
minimum: 0
maximum: 150
# Number (float)
price:
type: number
minimum: 0
exclusiveMaximum: 10000
# Boolean
active:
type: boolean
# Array
tags:
type: array
items:
type: string
minItems: 1
uniqueItems: true
# Object
address:
type: object
properties:
street:
type: string
city:
type: string
required:
- city
String Constraints¶
properties:
username:
type: string
minLength: 3
maxLength: 20
pattern: "^[a-z0-9_]+$" # lowercase, numbers, underscore only
email:
type: string
format: email
status:
type: string
enum: ["active", "inactive", "pending"]
country_code:
type: string
const: "US" # Fixed value
Numeric Constraints¶
properties:
age:
type: integer
minimum: 0
maximum: 150
price:
type: number
minimum: 0
exclusiveMaximum: 10000
multipleOf: 0.01 # Two decimal places
quantity:
type: integer
minimum: 1
default: 1
Nested Objects¶
properties:
user:
type: object
properties:
profile:
type: object
properties:
bio:
type: string
avatar_url:
type: string
format: uri
required:
- bio
settings:
type: object
properties:
notifications:
type: boolean
default: true
required:
- profile
Arrays¶
properties:
# Simple array
tags:
type: array
items:
type: string
minItems: 1
maxItems: 10
uniqueItems: true
# Array of objects
addresses:
type: array
items:
type: object
properties:
type:
type: string
enum: ["home", "work", "other"]
street:
type: string
city:
type: string
required:
- type
- city
minItems: 1
Part 3: Coercion Rules¶
Coercion transforms data before validation:
Built-in Coercions¶
| Coercion | Input → Output | Example |
|---|---|---|
coerce_to_string |
any → str | 123 → "123" |
coerce_to_integer |
str/float → int | "42" → 42 |
coerce_to_float |
str/int → float | "3.14" → 3.14 |
coerce_to_boolean |
str/int → bool | "true" → True |
coerce_to_datetime |
str → datetime | "2024-01-01" → datetime |
coerce_to_date |
str → date | "2024-01-01" → date |
coerce_to_lowercase |
str → str | "HELLO" → "hello" |
coerce_to_uppercase |
str → str | "hello" → "HELLO" |
coerce_to_stripped_string |
str → str | " hi " → "hi" |
coerce_to_list |
any → list | "a" → ["a"] |
coerce_empty_to_null |
empty → None | "" → None |
Inline Coercion¶
Add coercion directly in the schema:
properties:
age:
type: integer
coercion: coerce_to_integer
email:
type: string
coercion: coerce_to_lowercase
tags:
type: array
coercion: coerce_to_list
Separate Coercion Rules¶
version: "1.0.0"
rules:
age: coerce_to_integer
price: coerce_to_float
email: coerce_to_lowercase
status: coerce_to_uppercase
Custom Coercion¶
Register custom coercion functions:
from pycharter.shared.coercions import register_coercion
def coerce_phone_number(value):
"""Remove all non-digit characters from phone number."""
if isinstance(value, str):
return ''.join(c for c in value if c.isdigit())
return value
register_coercion("coerce_phone_number", coerce_phone_number)
Use in schema:
Part 4: Validation Rules¶
Validation checks data after schema validation:
Built-in Validations¶
| Validation | Description | Config |
|---|---|---|
min_length |
Minimum length | {"threshold": 3} |
max_length |
Maximum length | {"threshold": 100} |
is_positive |
Value > 0 | {} |
is_email |
Valid email | {} |
is_url |
Valid URL | {} |
is_alphanumeric |
Only letters/numbers | {} |
is_numeric_string |
Numeric string | {} |
matches_regex |
Match pattern | {"pattern": "..."} |
only_allow |
Whitelist | {"allowed_values": [...]} |
no_capital_characters |
No uppercase | {} |
no_special_characters |
No special chars | {} |
non_empty_string |
Not empty | {} |
is_unique |
Unique array items | {} |
Inline Validation¶
Add validations directly in the schema:
properties:
age:
type: integer
validations:
is_positive: {}
less_than_or_equal_to:
threshold: 150
username:
type: string
validations:
min_length:
threshold: 3
max_length:
threshold: 20
matches_regex:
pattern: "^[a-z0-9_]+$"
status:
type: string
validations:
only_allow:
allowed_values: ["active", "inactive", "pending"]
Separate Validation Rules¶
version: "1.0.0"
rules:
age:
is_positive: {}
less_than_or_equal_to:
threshold: 150
username:
min_length:
threshold: 3
no_special_characters: {}
email:
is_email: {}
Custom Validation¶
Register custom validation functions:
from pycharter.shared.validations import register_validation
def is_valid_phone(min_digits=10):
"""Validate phone number has minimum digits."""
def _validate(value, info):
if value is None:
return value
digits = ''.join(c for c in str(value) if c.isdigit())
if len(digits) < min_digits:
raise ValueError(f"Phone number must have at least {min_digits} digits")
return value
return _validate
register_validation("is_valid_phone", is_valid_phone)
Use in schema:
Part 5: Using the Validator¶
Creating Validators¶
from pycharter import Validator
# From single contract file
validator = Validator.from_file("user_contract.yaml")
# From directory (schema.yaml, coercion_rules.yaml, validation_rules.yaml)
validator = Validator.from_dir("contracts/user/")
# From explicit files
validator = Validator.from_files(
schema="schemas/user.yaml",
coercion_rules="rules/coercion.yaml",
validation_rules="rules/validation.yaml"
)
# From dictionaries
validator = Validator.from_dict(
schema=schema_dict,
coercion_rules=coercion_dict,
validation_rules=validation_dict
)
# From metadata store
from pycharter import SQLiteMetadataStore
store = SQLiteMetadataStore("metadata.db")
store.connect()
validator = Validator(store=store, schema_id="user_schema_v1")
Single Record Validation¶
result = validator.validate({
"id": "123", # Will be coerced to 123
"name": "Alice",
"email": "ALICE@EXAMPLE.COM", # Will be lowercased
"age": "30" # Will be coerced to 30
})
if result.is_valid:
print(f"Valid data: {result.data}")
# Access as Pydantic model
print(f"Name: {result.data.name}")
print(f"Email: {result.data.email}")
else:
print(f"Validation errors: {result.errors}")
Batch Validation¶
records = [
{"id": 1, "name": "Alice", "email": "alice@example.com", "age": 30},
{"id": 2, "name": "", "email": "invalid", "age": -5}, # Invalid
{"id": 3, "name": "Charlie", "email": "charlie@example.com", "age": 25},
]
results = validator.validate_batch(records)
valid_count = sum(1 for r in results if r.is_valid)
print(f"Valid: {valid_count}/{len(results)}")
# Process valid records
for result in results:
if result.is_valid:
process_record(result.data)
else:
log_errors(result.errors)
Strict Mode¶
Raise exceptions instead of returning errors:
from pydantic import ValidationError
try:
result = validator.validate(data, strict=True)
# If we get here, data is valid
process(result.data)
except ValidationError as e:
print(f"Validation failed: {e}")
Getting the Model¶
Access the generated Pydantic model:
# Get the model class
UserModel = validator.get_model()
# Use it directly
user = UserModel(id=1, name="Alice", email="alice@example.com", age=30)
print(user.model_dump())
# Export schema
print(UserModel.model_json_schema())
Part 6: Handling Errors¶
ValidationResult Structure¶
result = validator.validate(data)
# Check validity
if result.is_valid:
# Access validated data (Pydantic model instance)
validated = result.data
print(validated.name)
else:
# Access errors
for error in result.errors:
print(f"Field: {error['loc']}")
print(f"Message: {error['msg']}")
print(f"Type: {error['type']}")
Error Types¶
| Error Type | Description |
|---|---|
string_too_short |
String below minLength |
string_too_long |
String above maxLength |
string_pattern_mismatch |
Pattern not matched |
missing |
Required field missing |
int_parsing |
Cannot parse as integer |
value_error |
Custom validation failed |
enum |
Value not in enum |
Custom Error Messages¶
from pycharter.shared.validations import register_validation
def is_adult(min_age=18):
def _validate(value, info):
if value < min_age:
raise ValueError(f"Must be at least {min_age} years old")
return value
return _validate
register_validation("is_adult", is_adult)
Part 7: Integration Patterns¶
With ETL Pipelines¶
from pycharter import Pipeline, HTTPExtractor, FileLoader, CustomFunction, Validator
validator = Validator.from_file("contracts/user.yaml")
def validate_records(records):
"""Filter to only valid records."""
valid = []
for record in records:
result = validator.validate(record)
if result.is_valid:
valid.append(result.data.model_dump())
return valid
pipeline = (
Pipeline(HTTPExtractor(url="https://api.example.com/users"))
| CustomFunction(validate_records)
| FileLoader(path="output/valid_users.json")
)
With FastAPI¶
from fastapi import FastAPI, HTTPException
from pycharter import Validator
app = FastAPI()
validator = Validator.from_file("contracts/user.yaml")
@app.post("/users")
async def create_user(data: dict):
result = validator.validate(data)
if not result.is_valid:
raise HTTPException(status_code=422, detail=result.errors)
# Process valid data
return {"id": save_user(result.data)}
Decorator Pattern¶
from pycharter import validate_input, validate_output
@validate_input("contracts/user_input.yaml")
@validate_output("contracts/user_output.yaml")
def process_user(data: dict) -> dict:
# Input is already validated
# Output will be validated before return
return {
"id": generate_id(),
"name": data["name"],
"email": data["email"],
"created_at": datetime.now().isoformat()
}
Exercises¶
-
Basic Contract: Create a contract for a
Productwith fields: id, name, price, category, and in_stock. -
Coercion: Add coercion rules to handle string inputs for numeric fields.
-
Validation: Add validation rules to ensure price is positive and category is one of ["electronics", "clothing", "food"].
-
Custom Validation: Create a custom validation that checks if a product name doesn't contain special characters.
-
Integration: Use the validator in an ETL pipeline that filters out invalid products.
Next Steps¶
- Data Quality Monitoring - Monitor validation metrics
- Metadata Store - Store and version your contracts
- API Reference: Validator - Complete Validator documentation