Metadata-Version: 2.4
Name: snapparquet
Version: 0.1.3
Summary: Small helpers for fast Parquet/CSV: column pruning, optional integer range filters with PyArrow pushdown, footer-only schema.
Author: snapparquet contributors
License-Expression: MIT
Project-URL: Homepage, https://pypi.org/project/snapparquet/
Keywords: parquet,pyarrow,pandas,analytics,predicate-pushdown,arrow
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5.0
Requires-Dist: pyarrow>=12.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# snapparquet

## Project description

**snapparquet** provides small, dependency-light helpers for **fast Parquet workflows** in Python: **column pruning** (read only the fields you need), optional **integer-column range filters** with **PyArrow dataset predicate pushdown** when row-group statistics allow, and **footer-only schema** reads for discovery. It fits analytics pipelines and dashboards that aggregate many daily Parquet (or CSV) files.

## Install

Current release: **0.1.3**.

```bash
pip install "snapparquet==0.1.3"
```

To allow any compatible future release:

```bash
pip install "snapparquet>=0.1.3"
```

From a checkout:

```bash
pip install -e .
```

## Quick start

```python
import snapparquet as sp

cols = ["user_id", "load_date", "revenue"]

df = sp.read_tabular(
    "metrics.parquet",
    columns=cols,
    filter_column="load_date",
    range_start=20250301,
    range_end=20250331,
)

# Many daily files
df = sp.read_tabular_many(
    ["day1.parquet", "day2.parquet"],
    columns=cols,
    filter_column="load_date",
    range_start=20250301,
    range_end=20250331,
)
```



| Function | Purpose |
| -------- | ------- |
| `read_parquet(path, columns=..., filter_column=..., range_start=..., range_end=...)` | Parquet → pandas with pruning and optional range filter |
| `read_csv(...)` | CSV with `usecols` + same range filter (in memory) |
| `read_tabular(path, ...)` | Chooses Parquet vs CSV from the file suffix |
| `read_tabular_many(paths, unified_parquet=True, parallel_workers=None, ...)` | Concatenate multiple files; one PyArrow dataset scan when all paths are Parquet; optional parallel per-file fallback |
| `read_parquet_many(paths, ...)` | Same as `read_tabular_many` (alias for Parquet-heavy workflows) |
| `schema_column_names(path)` | Column set from Parquet metadata only |
| `intersect_columns(path, wanted)` | Ordered intersection for `columns=` |
| `column_is_integer_type(path, name)` | Whether range pushdown may apply |
| `schema_matches_any_group(names, groups)` | Schema shape checks (e.g. alternate mart layouts) |

Also exported: `norm_col`, `schema_has_all`.



If `filter_column` is stored as an **integer** type in the Parquet file, `read_parquet` / `read_tabular` try `pyarrow.dataset` first so row groups can be skipped using min/max statistics. Otherwise the same range filter runs in pandas after a (possibly pruned) read.

## Performance (0.1.3+)

- **Multi-file Parquet**: `read_tabular_many` first tries a **single** `pyarrow.dataset` over all paths (fewer opens, better fragment pruning). If that fails (mixed schemas, odd layouts), it falls back to **per-file** reads.
- **Parallel I/O**: By default, the fallback uses a **thread pool** when there are several paths (`parallel_workers=None`). Set `parallel_workers=1` for strict sequential reads (e.g. spinning disk or very small file counts).
- **Arrow → pandas**: Reads use threaded conversion where PyArrow supports it (`split_blocks` / `use_threads`).
- **Heavier analytics**: For SQL over many files, columnar scans at TB scale, or lazy query plans, consider **[Polars](https://www.pola.rs/)** or **[DuckDB](https://duckdb.org/)** on top of the same Parquet files; **snapparquet** stays small and pandas-first for dashboards and ETL scripts.
