Metadata-Version: 2.4
Name: snapparquet
Version: 0.3.0
Summary: Fast Parquet/CSV for pandas: column pruning, PyArrow pushdown, unified multi-file scans, tuple filters, optional parallel I/O, footer-only schema.
Author: snapparquet contributors
License-Expression: MIT
Project-URL: Homepage, https://pypi.org/project/snapparquet/
Keywords: parquet,pyarrow,pandas,analytics,predicate-pushdown,arrow
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5.0
Requires-Dist: pyarrow>=12.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# snapparquet

Small, **pandas-first** helpers for **fast Parquet and CSV** workflows: **column pruning**, **tuple-based filters** with **PyArrow dataset predicate pushdown**, optional **legacy integer date/partition ranges**, **unified multi-file scans**, optional **parallel per-file I/O**, and **footer-only schema** utilities for discovery and safe column lists.

**Non-goals:** snapparquet is not a query engine. For SQL over huge Parquet lakes, lazy plans, or TB-scale analytics, use [DuckDB](https://duckdb.org/), [Polars](https://www.pola.rs/), or similar; keep snapparquet for dashboards, ETL scripts, and pipelines that want a thin layer over **pandas + PyArrow**.

---

## Requirements

- **Python** 3.9+
- **pandas** ≥ 1.5  
- **PyArrow** ≥ 12  

---

## Install

**Current release: 0.3.0**

```bash
pip install "snapparquet==0.3.0"
```

Compatible future releases:

```bash
pip install "snapparquet>=0.3.0"
```

From a source checkout (editable):

```bash
cd snapparquet
pip install -e ".[dev]"
```

---

## Quick start

```python
import snapparquet as sp

cols = ["user_id", "load_date", "revenue"]

# Parquet with column pruning + legacy inclusive integer range on load_date
df = sp.read_tabular(
    "metrics.parquet",
    columns=cols,
    filter_column="load_date",
    range_start=20250301,
    range_end=20250331,
)

# Many daily Parquet files: one PyArrow dataset scan when possible
df = sp.read_tabular_many(
    ["day1.parquet", "day2.parquet", "day3.parquet"],
    columns=cols,
    filter_column="load_date",
    range_start=20250301,
    range_end=20250331,
)

# Tuple filters: AND-combined; pushed to Parquet via pyarrow.dataset when reading Parquet
df = sp.read_parquet(
    "events.parquet",
    columns=["status", "revenue"],
    filters=[("status", "in", ["active", "trial"]), ("revenue", ">", 0)],
)

# Pandas 2+ optional Arrow-backed dtypes (numpy fallback + warning on older pandas)
df = sp.read_parquet("wide.parquet", backend="pyarrow")
```

---

## Reading API

### `read_parquet(path, ...)`

Reads a single Parquet file into a **pandas** `DataFrame`.

| Argument | Description |
| -------- | ----------- |
| `columns` | Optional list of column names to decode (projection / pruning). Filter columns are auto-included when needed. |
| `filter_column`, `range_start`, `range_end` | Legacy **inclusive** integer range on `filter_column`. If that column is **integer-typed** in the Parquet footer and `use_pushdown=True`, the range is merged into the PyArrow dataset filter so row groups may be skipped. |
| `use_pushdown` | If `False`, the legacy range is **not** pushed (still applied in pandas when the column is not pushed). |
| `filters` | Sequence of `(column, operator, value)` tuples (see [Tuple filters](#tuple-filters)). AND-combined; invalid specs raise `ValueError`. |
| `backend` | `"numpy"` (default) or `"pyarrow"`. Invalid values raise `ValueError` immediately (before I/O). |
| `dictionary_as_categorical` | On the numpy path, map Arrow **dictionary-encoded** columns to `pandas.Categorical` to save memory. When `True`, `self_destruct` during Arrow→pandas is disabled so post-processing can still read dictionary buffers safely. |

**Read path (implementation summary):**

1. Try **PyArrow `dataset`** with a Parquet format that enables **row-group pre-buffering** when supported.
2. Call `Dataset.to_table(columns=..., filter=..., use_threads=True)` so **projection**, **predicate pushdown**, and **threaded decode** run together.
3. Convert with `Table.to_pandas` using `split_blocks` / `use_threads`, and optional `types_mapper=pd.ArrowDtype` for `backend="pyarrow"`.
4. If the legacy range **could not** be expressed in the dataset filter (e.g. column not integer in file, or `use_pushdown=False`), apply the same inclusive range **in pandas** after the Arrow read—**even when tuple `filters` were pushed down**—so results match the pandas fallback path.
5. On any failure opening or scanning via dataset, fall back to `pandas.read_parquet` and apply range + filters in pandas.

---

### `read_csv(path, ...)`

CSV → pandas with optional `usecols` (from `columns=`), legacy range, and tuple `filters`. **No** Arrow pushdown (filters and range are applied in pandas).

---

### `read_tabular(path, ...)`

Dispatches on the path suffix (`.parquet` / `.csv`, case-insensitive).

- **Parquet:** forwards to `read_parquet` (including `filters`, `backend`, `dictionary_as_categorical`).
- **CSV:** forwards to `read_csv`.

If `columns` is set, Parquet uses **schema intersection** with the file (`intersect_columns`); CSV reads the header and keeps overlapping names. If the intersection is empty and `full_read_if_no_column_match=True` (default), all columns are read then filters apply; if `False`, an empty `DataFrame` is returned when intersection is empty.

---

### `read_tabular_many(paths, ...)` and `read_parquet_many(paths, ...)`

Read many paths and **concatenate** (row-wise), index reset.

| Argument | Description |
| -------- | ----------- |
| `unified_parquet` | Default `True`. If **every** path ends with `.parquet`, try one **`pyarrow.dataset`** over the full list (single scan, shared pushdown across fragments). On failure (schemas, I/O quirks), fall back to per-file reads. |
| `parallel_workers` | `None` = auto-scaled thread pool for the per-file fallback (I/O-bound oversubscription). `0` or `1` = strict sequential. Otherwise caps at file count. |
| `ignore_errors` | If `True`, failed files become empty frames instead of raising (use with care). |
| Other kwargs | Same as `read_tabular` / `read_parquet` where applicable (`columns`, `filter_column`, `range_start`, `range_end`, `use_pushdown`, `filters`, `backend`, `dictionary_as_categorical`, `full_read_if_no_column_match`). |

`read_parquet_many` is an alias of `read_tabular_many` for naming clarity in Parquet-only code.

---

## Tuple filters

Each filter is a **3-tuple**: `(column_name, operator, value)`.

**Supported operators:** `==`, `!=`, `>`, `>=`, `<`, `<=`, `in`, `not_in`.

- For `in` / `not_in`, `value` must be a **non-string sequence** (e.g. list, tuple); strings are rejected to avoid foot-guns with character iteration.

Filters are **AND**-merged. For Parquet, `build_filter()` produces a `pyarrow.compute` expression consumed by the dataset API.

**Programmatic reuse:** `apply_filters_pandas(df, filters)` applies the same semantics in memory (CSV path, or debugging). It **raises** if a referenced column is missing.

**Column names** are normalized with `norm_col` (strip whitespace). Overly long names or control characters in filter columns raise `ValueError` (defensive guard for untrusted specs).

---

## Schema helpers (footer / metadata only)

These avoid decoding column data where possible.

| Function | Purpose |
| -------- | ------- |
| `schema_column_names(path)` | `frozenset` of column names from Parquet metadata, or `None` if unreadable. |
| `intersect_columns(path, wanted)` | Ordered intersection of `wanted` with file columns; `None` if schema unreadable; `[]` if no overlap. |
| `column_is_integer_type(path, name)` | Whether `name` exists and is an Arrow **integer** type (used to decide legacy range pushdown). |
| `schema_has_all(names, required)` | All required columns present in a name set. |
| `schema_matches_any_group(names, groups)` | True if at least one group’s columns are all present (e.g. alternate mart layouts). |
| `norm_col(name)` | Normalize a column identifier string. |

---

## `backend` and dictionary columns

| Option | Behavior |
| ------ | -------- |
| `backend="numpy"` (default) | Classic numpy-backed pandas dtypes (after Arrow conversion). |
| `backend="pyarrow"` | Uses `types_mapper=pd.ArrowDtype` when **pandas ≥ 2**; otherwise emits `UserWarning` and uses numpy. |
| `dictionary_as_categorical=True` | After conversion, dictionary-encoded Arrow columns are turned into `pandas.Categorical` on the numpy path where possible. |

---

## Performance notes (0.3.0)

- **Arrow-first single-file reads:** Parquet reads prefer `pyarrow.dataset` + threaded `to_table` for both filtered and unfiltered cases, then fall back to pandas only on failure.
- **Pre-buffering:** Parquet fragment scan options use **pre-buffered** row-group reads when the installed PyArrow supports it (throughput vs memory; safe fallback if options are unavailable).
- **Unified multi-file Parquet:** One dataset over many files reduces open/scan overhead and improves fragment-level pruning vs naive per-file loops.
- **Parallel fallback:** Default worker count scales with CPU and file count (capped) to hide I/O latency; use `parallel_workers=1` on very slow disks or tiny file counts if needed.
- **Heavier workloads:** For lazy query planning or massive scans, prefer Polars/DuckDB; snapparquet stays small and pandas-centric.

---

## Development and tests

```bash
pip install -e ".[dev]"
pytest tests/ -q                 # full suite
pytest tests/ -q -m "not stress" # skip @pytest.mark.stress cases
```

Stress-marked tests are larger synthetic scans (still seconds-scale, not microbenchmarks).

---

## Changelog (high level)

### 0.3.0

- **Performance / architecture:** Parquet reads use **PyArrow dataset first** for all successful paths (projection + optional filter + `use_threads=True`), with **Parquet pre-buffer** scan options when available.
- **Correctness:** When tuple filters are pushed down but the **legacy integer range** is **not** pushable (e.g. partition column stored as string), the inclusive range is still applied **in pandas** after the Arrow read (previously could be skipped on the dataset path).
- **Parallel I/O:** More aggressive default thread caps for multi-file fallback (still bounded by file count).
- **Small optimizations:** Dictionary index arrays use zero-copy numpy when PyArrow allows.
- **Tests:** Expanded edge-case and stress-style coverage; `stress` pytest marker registered in `pyproject.toml`.

### 0.2.0

- Tuple filters, `backend` / `dictionary_as_categorical`, merged dataset expressions with legacy range when integer pushdown applies.

---

## License

MIT — see repository `LICENSE` / `LICEN[CS]E*` files in the sdist.

---

## Publishing (maintainers)

```bash
pip install build twine
python -m build
twine check dist/*
twine upload dist/snapparquet-0.3.0*
```

Use a [PyPI API token](https://pypi.org/help/#apitoken) (e.g. `TWINE_USERNAME=__token__` and `TWINE_PASSWORD=<token>`). Ensure `README.md` and version in `pyproject.toml` / `snapparquet/_version.py` match before uploading.
