Metadata-Version: 2.4
Name: lakelogic
Version: 0.2.0
Summary: A Python-based data contract runtime for consistent quality across engines.
Author-email: LakeLogic Team <lakelogic@gmail.com>
License: Apache-2.0
License-File: LICENSE
Requires-Python: >=3.9
Requires-Dist: httpx>=0.27.0
Requires-Dist: loguru>=0.7.0
Requires-Dist: polars>=0.20.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: sqlglot>=20.0.0
Requires-Dist: typer>=0.9.0
Provides-Extra: all
Requires-Dist: apprise>=1.7.0; extra == 'all'
Requires-Dist: azure-eventgrid>=4.17.0; extra == 'all'
Requires-Dist: azure-identity>=1.15.0; extra == 'all'
Requires-Dist: azure-keyvault-secrets>=4.7.0; extra == 'all'
Requires-Dist: azure-servicebus>=7.11.0; extra == 'all'
Requires-Dist: azure-storage-blob>=12.19.0; extra == 'all'
Requires-Dist: boto3>=1.28.0; extra == 'all'
Requires-Dist: bytewax>=0.19.0; extra == 'all'
Requires-Dist: cryptography>=41.0.0; extra == 'all'
Requires-Dist: databricks-sdk>=0.18.0; extra == 'all'
Requires-Dist: dataprofiler>=0.9.0; extra == 'all'
Requires-Dist: deltalake>=0.15.0; extra == 'all'
Requires-Dist: duckdb>=0.9.0; extra == 'all'
Requires-Dist: google-cloud-bigquery>=3.11.0; extra == 'all'
Requires-Dist: google-cloud-pubsub>=2.18.0; extra == 'all'
Requires-Dist: google-cloud-secret-manager>=2.16.0; extra == 'all'
Requires-Dist: google-cloud-storage>=2.10.0; extra == 'all'
Requires-Dist: hvac>=2.0.0; extra == 'all'
Requires-Dist: jinja2>=3.1.0; extra == 'all'
Requires-Dist: kafka-python>=2.0.2; extra == 'all'
Requires-Dist: lxml>=4.9.0; extra == 'all'
Requires-Dist: nbclient>=0.9.0; extra == 'all'
Requires-Dist: nbformat>=5.9.0; extra == 'all'
Requires-Dist: openpyxl>=3.1.0; extra == 'all'
Requires-Dist: pandas>=2.0.0; extra == 'all'
Requires-Dist: paramiko>=3.4.0; extra == 'all'
Requires-Dist: polars>=0.20.0; extra == 'all'
Requires-Dist: presidio-analyzer>=2.2.0; extra == 'all'
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'all'
Requires-Dist: pyarrow>=14.0.0; extra == 'all'
Requires-Dist: pymongo>=4.6.0; extra == 'all'
Requires-Dist: pymysql>=1.1.0; extra == 'all'
Requires-Dist: pyodbc>=5.0.0; extra == 'all'
Requires-Dist: pyspark>=3.3.0; extra == 'all'
Requires-Dist: requests>=2.31.0; extra == 'all'
Requires-Dist: snowflake-connector-python>=3.5.0; extra == 'all'
Requires-Dist: sseclient-py>=1.8.0; extra == 'all'
Requires-Dist: websocket-client>=1.6.0; extra == 'all'
Provides-Extra: api
Requires-Dist: requests>=2.31.0; extra == 'api'
Provides-Extra: aws-messaging
Requires-Dist: boto3>=1.28.0; extra == 'aws-messaging'
Provides-Extra: azure-messaging
Requires-Dist: azure-eventgrid>=4.17.0; extra == 'azure-messaging'
Requires-Dist: azure-identity>=1.15.0; extra == 'azure-messaging'
Requires-Dist: azure-servicebus>=7.11.0; extra == 'azure-messaging'
Provides-Extra: azuresql
Requires-Dist: azure-identity>=1.15.0; extra == 'azuresql'
Requires-Dist: pyodbc>=5.0.0; extra == 'azuresql'
Provides-Extra: bigquery
Requires-Dist: google-cloud-bigquery>=3.11.0; extra == 'bigquery'
Provides-Extra: bytewax
Requires-Dist: bytewax>=0.19.0; extra == 'bytewax'
Provides-Extra: databases
Requires-Dist: azure-identity>=1.15.0; extra == 'databases'
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'databases'
Requires-Dist: pymongo>=4.6.0; extra == 'databases'
Requires-Dist: pymysql>=1.1.0; extra == 'databases'
Requires-Dist: pyodbc>=5.0.0; extra == 'databases'
Provides-Extra: delta
Requires-Dist: azure-identity>=1.15.0; extra == 'delta'
Requires-Dist: azure-storage-blob>=12.19.0; extra == 'delta'
Requires-Dist: boto3>=1.28.0; extra == 'delta'
Requires-Dist: databricks-sdk>=0.18.0; extra == 'delta'
Requires-Dist: deltalake>=0.15.0; extra == 'delta'
Requires-Dist: google-cloud-storage>=2.10.0; extra == 'delta'
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: hypothesis>=6.100.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-jupyter>=0.24.0; extra == 'docs'
Requires-Dist: mkdocs-material>=9.0.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.20.0; extra == 'docs'
Provides-Extra: duckdb
Requires-Dist: deltalake>=0.15.0; extra == 'duckdb'
Requires-Dist: duckdb>=0.9.0; extra == 'duckdb'
Requires-Dist: lxml>=4.9.0; extra == 'duckdb'
Requires-Dist: openpyxl>=3.1.0; extra == 'duckdb'
Requires-Dist: pandas>=2.0.0; extra == 'duckdb'
Requires-Dist: pyarrow>=14.0.0; extra == 'duckdb'
Provides-Extra: gcp-messaging
Requires-Dist: google-cloud-pubsub>=2.18.0; extra == 'gcp-messaging'
Provides-Extra: integrations
Requires-Dist: azure-eventgrid>=4.17.0; extra == 'integrations'
Requires-Dist: azure-identity>=1.15.0; extra == 'integrations'
Requires-Dist: azure-servicebus>=7.11.0; extra == 'integrations'
Requires-Dist: boto3>=1.28.0; extra == 'integrations'
Requires-Dist: google-cloud-pubsub>=2.18.0; extra == 'integrations'
Requires-Dist: paramiko>=3.4.0; extra == 'integrations'
Requires-Dist: requests>=2.31.0; extra == 'integrations'
Provides-Extra: kafka
Requires-Dist: kafka-python>=2.0.2; extra == 'kafka'
Provides-Extra: mongodb
Requires-Dist: pymongo>=4.6.0; extra == 'mongodb'
Provides-Extra: mysql
Requires-Dist: pymysql>=1.1.0; extra == 'mysql'
Provides-Extra: notebook
Requires-Dist: nbclient>=0.9.0; extra == 'notebook'
Requires-Dist: nbformat>=5.9.0; extra == 'notebook'
Provides-Extra: notifications
Requires-Dist: apprise>=1.7.0; extra == 'notifications'
Requires-Dist: azure-identity>=1.15.0; extra == 'notifications'
Requires-Dist: azure-keyvault-secrets>=4.7.0; extra == 'notifications'
Requires-Dist: boto3>=1.28.0; extra == 'notifications'
Requires-Dist: cryptography>=41.0.0; extra == 'notifications'
Requires-Dist: google-cloud-secret-manager>=2.16.0; extra == 'notifications'
Requires-Dist: hvac>=2.0.0; extra == 'notifications'
Requires-Dist: jinja2>=3.1.0; extra == 'notifications'
Provides-Extra: pandas
Requires-Dist: deltalake>=0.15.0; extra == 'pandas'
Requires-Dist: duckdb>=0.9.0; extra == 'pandas'
Requires-Dist: lxml>=4.9.0; extra == 'pandas'
Requires-Dist: openpyxl>=3.1.0; extra == 'pandas'
Requires-Dist: pandas>=2.0.0; extra == 'pandas'
Provides-Extra: pathway
Requires-Dist: pathway>=0.7.0; (python_version >= '3.10') and extra == 'pathway'
Provides-Extra: polars
Requires-Dist: deltalake>=0.15.0; extra == 'polars'
Requires-Dist: lxml>=4.9.0; extra == 'polars'
Requires-Dist: openpyxl>=3.1.0; extra == 'polars'
Requires-Dist: polars>=0.20.0; extra == 'polars'
Provides-Extra: postgresql
Requires-Dist: azure-identity>=1.15.0; extra == 'postgresql'
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'postgresql'
Provides-Extra: profiling
Requires-Dist: dataprofiler>=0.9.0; extra == 'profiling'
Requires-Dist: presidio-analyzer>=2.2.0; extra == 'profiling'
Provides-Extra: sftp
Requires-Dist: paramiko>=3.4.0; extra == 'sftp'
Provides-Extra: snowflake
Requires-Dist: snowflake-connector-python>=3.5.0; extra == 'snowflake'
Provides-Extra: spark
Requires-Dist: pyspark>=3.3.0; extra == 'spark'
Provides-Extra: sse
Requires-Dist: sseclient-py>=1.8.0; extra == 'sse'
Provides-Extra: streaming
Requires-Dist: bytewax>=0.19.0; extra == 'streaming'
Requires-Dist: kafka-python>=2.0.2; extra == 'streaming'
Requires-Dist: pathway>=0.7.0; (python_version >= '3.10') and extra == 'streaming'
Requires-Dist: sseclient-py>=1.8.0; extra == 'streaming'
Requires-Dist: websocket-client>=1.6.0; extra == 'streaming'
Provides-Extra: websocket
Requires-Dist: websocket-client>=1.6.0; extra == 'websocket'
Description-Content-Type: text/markdown

# LakeLogic

**Trust Your Data. Scale Your Logic.**

*Write Once. Run Anywhere.* — The open-source runtime for data contracts with quarantine.

LakeLogic is a SQL-first, infrastructure-agnostic quality gate that ensures your business decisions are based on data you can trust. It scales your validation logic from local Polars to petabyte-scale Spark without rewriting a single rule.

[![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://lakelogic.github.io/LakeLogic/)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/LakeLogic/LakeLogic)
[![License](https://img.shields.io/badge/license-Apache%202.0-green)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.9+-blue?logo=python)](https://www.python.org)
[![Version](https://img.shields.io/badge/version-0.1.0-orange)](CHANGELOG.md)

---

## The Core Value: Write Once. Run Anywhere

Stop paying the **"Infrastructure Lock-In Tax."** In a traditional stack, moving from a Warehouse (Snowflake) to a Lakehouse (Databricks) means months of rewriting validation rules. LakeLogic decouples your **Business Logic** from your **Execution Engine**.

1. **Cost Efficiency (The Spark Tax ROI):** Run 80% of your maintenance checks on **Polars** or **DuckDB** for pennies, while reserving **Spark** for your massive production scales.
2. **Risk Mitigation (100% Reconciliation):** Ensure `Source = Good + Quarantined`. Mathematically prove that no record was lost or double-counted across your layers.
3. **Stakeholder Trust (Visual Traceability):** Use aggregate roll-ups to give your business users a visual drill-down from board-level KPIs back to raw source records.

## Key Features

- **SQL-First Logic**: Use the SQL expressions you already know for transformations and quality rules.
- **Schema Enforcement**: Type casting, required fields, and unknown-field handling.
- **Intelligent Quarantine**: Records that fail rules are detoured, tagged with error messages, and saved for correction.
- **Lineage Injection**: Tag records with source path, run ID, and processing timestamp.
- **Materialization**: Write validated data to local CSV/Parquet targets or Delta/Iceberg when running on Spark.
- **Referential Integrity**: Validate keys against dimensions using local reference tables.
- **Contract Inference**: Auto-generate contracts from landing-zone files with `lakelogic bootstrap`.
- **dbt Import**: Convert dbt `schema.yml` / `sources.yml` into LakeLogic contracts with `lakelogic import-dbt`.
- **Synthetic Data Generation**: Generate realistic test data from any contract with `DataGenerator`.
- **External Logic Hooks**: Run dedicated Python modules or notebooks for advanced Gold processing.
- **Policy Packs**: Apply standardised rule sets and defaults across all contracts.
- **Notifications**: Built-in adapters log alerts for quarantine and rule failures.
- **Observability**: Prometheus metrics endpoint, summary tables, and execution tracing.
- **Delta Lake Support (Spark-Free)**: Read/write/merge Delta tables with Polars, DuckDB, or Pandas — no Spark required.
- **Catalog Table Names**: Use Unity Catalog, Fabric LakeDB, and Synapse table names (`catalog.schema.table`) directly.
- **Streaming Ingestion**: Kafka, WebSocket, SSE, Azure Service Bus, GCP Pub/Sub, AWS SQS.
- **Database CDC**: Azure SQL, PostgreSQL, MySQL, MongoDB, Oracle, SQL Server change capture.

## Installation

```bash
# Get the full engine suite
uv pip install "lakelogic[all]"

# Or just use Polars for local speed
uv pip install "lakelogic[polars]"

# Delta Lake support (Spark-free)
uv pip install "lakelogic[delta]"

# Profiling + PII detection (bootstrap)
uv pip install "lakelogic[profiling]"

# Database CDC connectors
uv pip install "lakelogic[databases]"

# Streaming sources
uv pip install "lakelogic[streaming]"
```

See the full installation guide in `docs/installation.md`.

## Quick Start

```python
from lakelogic import DataProcessor

# 1. Run the Quality Gate (Automatic Engine Selection)
processor = DataProcessor(contract="silver_crm_customers.yaml")
good_df, bad_df = processor.run_source()

# good_df -> Ready for Silver Layer
# bad_df  -> Sent to Quarantine
```

`run_source()` automatically reads the source path from your contract. You can also pass an explicit path:

```python
good_df, bad_df = processor.run_source("bronze_crm_customers.csv")
```

The return value is a `ValidationResult` that unpacks as two DataFrames. Access the raw (pre-validation) frame via `result.raw`:

```python
result = processor.run_source()
print(f"Total: {len(result.raw)} | Valid: {len(result.good)} | Quarantined: {len(result.bad)}")
```

## Delta Lake & Catalog Support (Spark-Free!)

### **Unity Catalog (Databricks)**

```python
from lakelogic import DataProcessor

# Use Unity Catalog table names directly (no Spark required!)
processor = DataProcessor(engine="polars", contract="contracts/customers.yaml")
good_df, bad_df = processor.run_source("main.default.customers")

# LakeLogic automatically:
# 1. Resolves table name to storage path
# 2. Uses Delta-RS for fast, Spark-free operations
# 3. Validates data with your contract rules
```

### **Fabric LakeDB (Microsoft)**

```python
processor = DataProcessor(engine="polars", contract="contracts/sales.yaml")
good_df, bad_df = processor.run_source("myworkspace.sales_lakehouse.customers")
```

### **Synapse Analytics (Azure)**

```python
processor = DataProcessor(engine="polars", contract="contracts/sales.yaml")
good_df, bad_df = processor.run_source("salesdb.dbo.customers")
```

**Learn more:** [Delta Lake Support](docs/delta_lake_support.md) | [Catalog Table Names](docs/catalog_table_names.md)

## dbt Integration

Import existing dbt projects directly — no rewrite needed:

```bash
# Convert a dbt model to a LakeLogic contract
lakelogic import-dbt --schema models/schema.yml --model customers --output contracts/

# Or use the Python API
from lakelogic import DataProcessor
proc = DataProcessor.from_dbt("models/schema.yml", model="customers")
good_df, bad_df = proc.run_source()
```

## Get Started

**[📚 Read the Docs](https://LakeLogic.github.io/LakeLogic)** | **[🚀 Quickstart Guide](https://LakeLogic.github.io/LakeLogic/quickstart/)** | **[💬 Discussions](https://github.com/LakeLogic/LakeLogic/discussions)**

### Run Your First Contract (5 Minutes)

```bash
# Clone the repo
git clone https://github.com/LakeLogic/LakeLogic.git
cd LakeLogic/examples/01_quickstart

# Run the example
lakelogic run --contract users_contract.yaml --source data/sample_customers.csv
```

You'll see:
- ✅ Good records that passed validation
- ❌ Quarantined records with error reasons
- 📊 Quality metrics and health scores

## Explore the Examples

The [`examples/`](https://github.com/LakeLogic/LakeLogic/tree/main/examples) directory contains 24 runnable notebooks across 4 tested categories:

| Category | Directory | What You'll Learn |
|---|---|---|
| **Quickstart** | `01_quickstart/` | Your first contract in 5 minutes, database governance, dbt+PII |
| **Core Patterns** | `02_core_patterns/` | Medallion architecture, bronze quality gates, SCD2, deduplication, reference joins, soft deletes |
| **Advanced Workflows** | `03_advanced_workflows/` | Insurance ELT pipeline, GDPR compliance, late-arriving data, external Python logic, environment promotion, bootstrap, date dimensions, multi-tenant isolation, partitioned merge, payments lifecycle, streaming, synthetic data generation |
| **Compliance** | `04_compliance_governance/` | HIPAA PII masking |

> **Looking for more?** Additional examples for data sources, cloud platforms, orchestration, and production patterns are in `examples/_archive/`. These are functional but not yet fully tested.

## Documentation

- **[Full Documentation](https://LakeLogic.github.io/LakeLogic)** — Complete guides and API reference
- **[How It Works](https://LakeLogic.github.io/LakeLogic/concepts/)** — Medallion architecture and core concepts
- **[CLI Reference](https://LakeLogic.github.io/LakeLogic/cli/)** — Command-line usage
- **[API Reference](https://LakeLogic.github.io/LakeLogic/api/)** — Python API documentation
- **[Reprocessing Guide](https://LakeLogic.github.io/LakeLogic/reprocessing/)** — Handle late-arriving data
- **[Contract Template](docs/contract_template.md)** — Full YAML reference for all contract fields
- **[Streaming](docs/streaming_implementation_complete.md)** — Real-time ingestion guide

## Contributing

See `CONTRIBUTING.md` to get started, or `docs/installation.md#developer-installation` for environment setup.

---

### License

Apache-2.0
