Metadata-Version: 2.4
Name: snapparquet
Version: 0.1.1
Summary: Fast column-pruned Parquet reads with optional predicate pushdown for analytics.
Author: snapparquet contributors
License-Expression: MIT
Project-URL: Homepage, https://pypi.org/project/snapparquet/
Keywords: parquet,pyarrow,pandas,analytics,predicate-pushdown,arrow
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5.0
Requires-Dist: pyarrow>=12.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# snapparquet

Small, dependency-light helpers for **fast Parquet workflows** in Python: **column pruning** (read only the fields you need), optional **integer-column range filters** with **PyArrow dataset predicate pushdown** when row-group statistics allow, and **footer-only schema** reads for discovery.

## Install

```bash
pip install snapparquet
```

From a checkout:

```bash
pip install -e .
```

## Quick start

```python
import snapparquet as sp

cols = ["user_id", "load_date", "revenue"]

df = sp.read_tabular(
    "metrics.parquet",
    columns=cols,
    filter_column="load_date",
    range_start=20250301,
    range_end=20250331,
)

# Many daily files
df = sp.read_tabular_many(
    ["day1.parquet", "day2.parquet"],
    columns=cols,
    filter_column="load_date",
    range_start=20250301,
    range_end=20250331,
)
```

## API

| Function | Purpose |
|----------|---------|
| `read_parquet(path, columns=..., filter_column=..., range_start=..., range_end=...)` | Parquet → pandas with pruning and optional range filter |
| `read_csv(...)` | CSV with `usecols` + same range filter (in memory) |
| `read_tabular(path, ...)` | Chooses Parquet vs CSV from the file suffix |
| `read_tabular_many(paths, ...)` | Concatenate multiple files |
| `schema_column_names(path)` | Column set from Parquet metadata only |
| `intersect_columns(path, wanted)` | Ordered intersection for `columns=` |
| `column_is_integer_type(path, name)` | Whether range pushdown may apply |
| `schema_matches_any_group(names, groups)` | Schema shape checks (e.g. alternate mart layouts) |

## When pushdown applies

If `filter_column` is stored as an **integer** type in the Parquet file, `read_parquet` / `read_tabular` try `pyarrow.dataset` first so row groups can be skipped using min/max statistics. Otherwise the same range filter runs in pandas after a (possibly pruned) read.

## Publishing to PyPI

Follow the [PyPA packaging tutorial](https://packaging.python.org/en/latest/tutorials/packaging-projects/): use **TestPyPI** first, then production PyPI.

1. Register at [pypi.org](https://pypi.org/account/register/) and create an **API token** under [Account settings → API tokens](https://pypi.org/manage/account/).
2. Confirm the project name is available: `pip index versions snapparquet`
3. From this directory:

```bash
python3 -m pip install --upgrade pip build twine
python3 -m build
python3 -m twine check dist/*
```

4. **TestPyPI** (recommended first upload):

```bash
python3 -m twine upload --repository testpypi dist/*
```

5. **Production PyPI** (omit `--repository`; default is [pypi.org](https://pypi.org/)):

```bash
python3 -m twine upload dist/*
```

Use username `__token__` and password = your token (including the `pypi-` prefix), or set `TWINE_USERNAME=__token__` and `TWINE_PASSWORD`.

After publishing, set `Repository` / `Issues` in `[project.urls]` in `pyproject.toml` to your real Git host.

## License

MIT
