Metadata-Version: 2.4
Name: querygym
Version: 0.1.2
Summary: LLM-based Query Reformulation Toolkit
Author-email: Radin Hamidi <radin.h@gmail.com>, Amin Bigdeli <aminbigdeli97@gmail.com>, Mert Incesu <mert.incesu03@gmail.com>, Negar Arabzadeh <ngr.arabzadeh@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/radinhamidi/QueryGym
Project-URL: Documentation, https://github.com/radinhamidi/QueryGym/blob/main/README.md
Project-URL: Repository, https://github.com/radinhamidi/QueryGym
Project-URL: Issues, https://github.com/radinhamidi/QueryGym/issues
Keywords: information-retrieval,query-reformulation,llm,nlp
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer>=0.12.3
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: openai>=1.40.0
Requires-Dist: tqdm>=4.65.0
Provides-Extra: hf
Requires-Dist: datasets>=2.20.0; extra == "hf"
Provides-Extra: beir
Requires-Dist: beir>=2.0.0; extra == "beir"
Provides-Extra: pyserini
Requires-Dist: pyserini>=0.22.0; extra == "pyserini"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-mock>=3.11.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Requires-Dist: mkdocs>=1.5.0; extra == "dev"
Requires-Dist: mkdocs-material>=9.0.0; extra == "dev"
Requires-Dist: mkdocstrings[python]>=0.23.0; extra == "dev"
Provides-Extra: all
Requires-Dist: datasets>=2.20.0; extra == "all"
Requires-Dist: beir>=2.0.0; extra == "all"
Requires-Dist: pyserini>=0.22.0; extra == "all"
Dynamic: license-file

[![Publish to PyPI](https://github.com/radinhamidi/QueryGym/actions/workflows/publish.yml/badge.svg)](https://github.com/radinhamidi/QueryGym/actions/workflows/publish.yml)

# querygym

A lightweight, reproducible toolkit for **LLM-based query reformulation**.

- Single **Prompt Bank** (YAML) with metadata.
- **Simple DataLoader**: Dependency-free file loading for queries, qrels, and contexts.
- **Format Loaders**: Optional BEIR and MS MARCO format loaders in `querygym.loaders`.
- **OpenAI-compatible** LLM client (works with any OpenAI API–compatible endpoint).
- **Pyserini** optional: either pass contexts (JSONL) or pass a retriever instance to build contexts.
- Export-only: emits reformulated queries; optionally generates a **bash** script for Pyserini + `trec_eval`.

## Quickstart

### Python API (Recommended)
```python
import querygym as qg

# Load data
queries = qg.load_queries("queries.tsv")
qrels = qg.load_qrels("qrels.txt")
contexts = qg.load_contexts("contexts.jsonl")

# Create reformulator
reformulator = qg.create_reformulator("genqr_ensemble", model="gpt-4")

# Reformulate
results = reformulator.reformulate_batch(queries)

# Save
qg.DataLoader.save_queries(
    [qg.QueryItem(r.qid, r.reformulated) for r in results],
    "reformulated.tsv"
)
```

### CLI
```bash
pip install -e .[hf,beir,dev]
export OPENAI_API_KEY=sk-...

# Run a method (e.g., genqr_ensemble)
querygym run --method genqr_ensemble \
  --queries-tsv queries.tsv \
  --output-tsv reformulated.tsv \
  --cfg-path querygym/config/defaults.yaml
```

### Loading Datasets

**BEIR:**
```python
import querygym as qg

# Download with BEIR library
from beir.datasets.data_loader import GenericDataLoader
data_path = GenericDataLoader("nfcorpus").download_and_unzip()

# Load with querygym
queries = qg.loaders.beir.load_queries(data_path)
qrels = qg.loaders.beir.load_qrels(data_path)
```

**MS MARCO:**
```python
import querygym as qg

# Load from local files (download with ir_datasets)
queries = qg.loaders.msmarco.load_queries("queries.tsv")
qrels = qg.loaders.msmarco.load_qrels("qrels.tsv")
```

See [example scripts](scripts/README.md) for complete workflows.
