Metadata-Version: 2.4
Name: theoremkb
Version: 0.2.0
Summary: Unified dataset loaders for e-SNLI, QASC, and WorldTree.
Author: Xinquan
License-Expression: MIT
Project-URL: Homepage, https://github.com/your-name/theoremkb
Project-URL: Repository, https://github.com/your-name/theoremkb
Keywords: datasets,nlp,reasoning,qa,nli
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: datasets<3.0.0,>=2.19.0
Provides-Extra: pandas
Requires-Dist: pandas>=2.0.0; extra == "pandas"
Provides-Extra: dev
Requires-Dist: build>=1.2.2; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: twine>=5.1.1; extra == "dev"
Dynamic: license-file

# theoremkb

`theoremkb` is a Python package with a unified API for loading:

- e-SNLI
- QASC
- WorldTree

The package uses Hugging Face `datasets` as the backend and keeps loader parameters consistent across all datasets.

## Installation

```bash
pip install theoremkb
```

For pandas output:

```bash
pip install theoremkb[pandas]
```

## Quick Start

```python
from theoremkb import load_qasc

records = load_qasc(
    split="train",
    as_format="records",
    max_samples=100,
    shuffle=True,
    seed=7,
)
print(records[0])
```

Generic loader:

```python
from theoremkb import load

dataset = load("worldtree", split="train")
```

Terminal demo: print the 30th eSNLI sample

```bash
PYTHONPATH=src python examples/esnli_thirtieth_record.py
```

Notebook demo

- `notebooks/theoremkb_esnli_demo.ipynb`

## Unified Parameters

All dataset-specific loaders have the same signature:

- `split`: dataset split (`train`, `validation`, `test`; `dev/val` auto-mapped to `validation`)
- `subset`: optional HF config name
- `cache_dir`: custom cache directory
- `revision`: dataset revision/commit
- `token`: HF auth token
- `force_download`: force redownload (`True` -> `force_redownload`)
- `streaming`: use streaming mode
- `trust_remote_code`: required for some datasets (currently `esnli`)
- `shuffle`: shuffle loaded samples
- `seed`: random seed for shuffle
- `max_samples`: truncate sample size (required when `streaming=True` and `as_format` is `records/pandas`)
- `as_format`: one of `datasets`, `records`, `pandas`
- `drop_empty_fields`: when `True` (default), remove fields whose value is `""` or `None` in `records/pandas` output
- `validate_split`: query and validate split name before loading

## Included Dataset Mapping

- `esnli` -> `esnli/esnli`
- `qasc` -> `allenai/qasc`
- `worldtree` -> `nguyen-brat/worldtree`

Use `list_datasets()` to list canonical names.

## Publishing

```bash
python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*
python -m twine upload dist/*
```

## Notes

- `load_esnli(...)` requires `trust_remote_code=True` in current Hugging Face setup.
- If you use `streaming=True` with `as_format="records"` or `as_format="pandas"`, set `max_samples` to avoid unbounded materialization.

## Troubleshooting

If you see an error like `Dataset scripts are no longer supported, but found ...`,
your environment is likely using a too-new `datasets` version.

Run:

```bash
python -m pip install "datasets<3"
```

If you see TLS/SSL errors like `UNEXPECTED_EOF_WHILE_READING`, your machine cannot establish a secure connection to Hugging Face.

Check:

```bash
python - <<'PY'
import requests
print(requests.get("https://huggingface.co", timeout=15).status_code)
PY
```

If needed, configure proxy/certs (`HTTPS_PROXY`, `HTTP_PROXY`, `REQUESTS_CA_BUNDLE`).
For temporary debugging only, you can disable SSL verification:

```bash
export HF_HUB_DISABLE_SSL_VERIFICATION=1
```
