Metadata-Version: 2.4
Name: scduck
Version: 0.1.1
Summary: SCD Type 2 tables with DuckDB. Track historical changes to slowly-changing data.
Project-URL: Homepage, https://github.com/wolferesearch/scduck
Project-URL: Repository, https://github.com/wolferesearch/scduck
Author: papasaidfine
License: MIT
Keywords: data-warehouse,duckdb,history,scd,slowly-changing-dimension,temporal
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Requires-Python: >=3.10
Requires-Dist: duckdb>=0.9.0
Requires-Dist: pyarrow>=14.0.0
Provides-Extra: all
Requires-Dist: pandas>=2.0.0; extra == 'all'
Requires-Dist: polars>=0.19.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: pandas>=2.0.0; extra == 'dev'
Requires-Dist: polars>=0.19.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Provides-Extra: pandas
Requires-Dist: pandas>=2.0.0; extra == 'pandas'
Provides-Extra: polars
Requires-Dist: polars>=0.19.0; extra == 'polars'
Description-Content-Type: text/markdown

# scduck

Store time series of snapshots in a SCD Type 2 table.

**13 days of data: 65 MB CSV -> 6.3 MB DuckDB (~10x compression)**

## How it works

Records are stored with `valid_from` / `valid_to` date ranges. When data doesn't change, no new rows are written. Only changes generate new records.

```
id   | name  | price | valid_from | valid_to
P001 | Widget| 9.99  | 2025-01-01 | 2025-03-15  # original price
P001 | Widget| 12.99 | 2025-03-15 | NULL        # price changed
P002 | Gadget| 4.99  | 2025-01-01 | NULL        # unchanged
```

- `valid_from`: inclusive (>=)
- `valid_to`: exclusive (<), NULL = current

## Usage

```python
from scduck import SCDTable

# Define your schema
with SCDTable(
    "products.duckdb",
    table="products",
    keys=["product_id"],
    values=["name", "price", "category"]
) as db:
    # Sync daily snapshots (pandas, polars, or pyarrow)
    result = db.sync("2025-01-01", df_jan1)  # returns SyncResult
    db.sync("2025-01-02", df_jan2)

    # Reconstruct any historical snapshot
    snapshot = db.get_data("2025-01-01")  # returns pyarrow Table

    # Check synced dates
    db.get_synced_dates()  # ['2025-01-01', '2025-01-02']
```

### Out-of-order sync

Dates can be synced in any order:

```python
db.sync("2025-01-15", df)  # sync Jan 15 first
db.sync("2025-01-01", df)  # backfill Jan 1
db.get_data("2025-01-01")  # returns correct snapshot
```

## Example: SecurityMaster

```python
import pandas as pd
from scduck import SCDTable

with SCDTable(
    "security_master.duckdb",
    table="securities",
    keys=["security_id"],
    values=["ticker", "mic", "isin", "description",
            "sub_industry", "country", "currency", "country_risk"]
) as db:
    df = pd.read_csv("SecurityMaster_20251201.csv")
    db.sync("2025-12-01", df)
```

## Installation

```bash
pip install scduck

# With pandas/polars support
pip install scduck[all]
```

## Sync Logic

See [SYNC_LOGIC.md](SYNC_LOGIC.md) for detailed operation cases.
