Metadata-Version: 2.4
Name: themis-eval
Version: 3.0.0
Summary: Lightweight evaluation platform for LLM experiments
Author: Pittawat Taveekitworachai
License-Expression: MIT
Project-URL: Homepage, https://pittawat2542.github.io/themis/
Project-URL: Repository, https://github.com/Pittawat2542/themis
Project-URL: Documentation, https://pittawat2542.github.io/themis/
Project-URL: Changelog, https://github.com/Pittawat2542/themis/blob/main/CHANGELOG.md
Project-URL: Issues, https://github.com/Pittawat2542/themis/issues
Keywords: llm,evaluation,benchmark,experiments,ai
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.12.5
Requires-Dist: rich>=13.9.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=6.0.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.3.1; extra == "dev"
Requires-Dist: pytest-asyncio>=0.24.0; extra == "dev"
Requires-Dist: ruff>=0.8.5; extra == "dev"
Requires-Dist: mypy>=1.14.0; extra == "dev"
Provides-Extra: compression
Requires-Dist: zstandard>=0.25.0; extra == "compression"
Provides-Extra: datasets
Requires-Dist: datasets>=2.20.0; extra == "datasets"
Provides-Extra: extractors
Requires-Dist: jsonschema>=4.0.0; extra == "extractors"
Provides-Extra: providers-openai
Requires-Dist: openai>=1.0.0; extra == "providers-openai"
Provides-Extra: providers-litellm
Requires-Dist: litellm>=1.81.0; extra == "providers-litellm"
Requires-Dist: tenacity>=9.1.2; extra == "providers-litellm"
Provides-Extra: providers-vllm
Requires-Dist: vllm>=0.17.0; sys_platform == "linux" and extra == "providers-vllm"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.6.0; extra == "docs"
Requires-Dist: mkdocs-material>=9.5.0; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.25.0; extra == "docs"
Provides-Extra: stats
Requires-Dist: numpy>=1.24.0; extra == "stats"
Requires-Dist: scipy>=1.10.0; extra == "stats"
Requires-Dist: pandas>=2.0.0; extra == "stats"
Provides-Extra: telemetry
Requires-Dist: langfuse>=3.0.0; extra == "telemetry"
Requires-Dist: wandb>=0.19.0; extra == "telemetry"
Provides-Extra: storage-postgres
Requires-Dist: psycopg[binary]>=3.2.0; extra == "storage-postgres"
Provides-Extra: all
Requires-Dist: themis-eval[compression,datasets,docs,extractors,providers-litellm,providers-openai,providers-vllm,stats,storage-postgres,telemetry]; extra == "all"
Dynamic: license-file

# Themis

> Benchmark-first orchestration for reproducible LLM evaluation.

[![CI](https://github.com/Pittawat2542/themis/actions/workflows/ci.yml/badge.svg)](https://github.com/Pittawat2542/themis/actions/workflows/ci.yml)
[![Docs](https://github.com/Pittawat2542/themis/actions/workflows/docs.yml/badge.svg)](https://github.com/Pittawat2542/themis/actions/workflows/docs.yml)
[![PyPI version](https://img.shields.io/pypi/v/themis-eval.svg)](https://pypi.org/project/themis-eval/)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-yellow.svg)](LICENSE)

Themis now documents and supports one public authoring flow:

- `ProjectSpec` for shared storage and execution policy
- `BenchmarkSpec` for benchmark slices, prompt variants, parse pipelines, and scores
- `PluginRegistry` for engines, parsers, metrics, judges, and hooks
- `Orchestrator` for planning, execution, handoffs, and imports
- `BenchmarkResult` for aggregation, paired comparison, artifact bundles, and timelines
- `generate_config_report(...)` for reproducibility snapshots
- `themis-quickcheck` for fast SQLite inspection by slice and benchmark dimension

## Why Themis

- Benchmark-native authoring instead of experiment-matrix bookkeeping
- Query-aware dataset providers for subset, filter, and pushdown sampling
- Explicit prompt variants and parse pipelines instead of payload hacks
- Projection-backed results with `slice_id`, `prompt_variant_id`, and semantic dimensions
- Local-first storage and deterministic reuse of completed work

## Installation

```bash
uv add themis-eval
```

Add extras only when needed:

- `stats` for paired comparisons and richer report tooling
- `compression` for compressed artifact storage
- `extractors` for additional built-in parsing helpers
- `datasets` for dataset integrations
- `providers-openai`, `providers-litellm`, `providers-vllm` for provider SDKs
- `telemetry` for external observability callbacks
- `storage-postgres` for Postgres-backed storage

## Quick Start

Run the shipped hello-world benchmark:

```bash
uv run python examples/01_hello_world.py
```

Expected output:

```text
{'model_id': 'demo-model', 'slice_id': 'arithmetic', 'metric_id': 'exact_match', 'source': 'synthetic', 'prompt_variant_id': 'qa-default', 'mean': 1.0, 'count': 1}
```

That script shows the full benchmark-first loop:

- define a `DatasetProvider.scan(slice_spec, query)`
- register one engine and one metric
- build a `BenchmarkSpec`
- run `orchestrator.run_benchmark(...)`
- inspect the returned `BenchmarkResult`

The complete script is embedded in [docs/quick-start/index.md](docs/quick-start/index.md).

## Examples

Runnable examples live in [`examples/`](examples/):

- `01_hello_world.py`
- `02_project_file.py`
- `03_custom_extractor_metric.py`
- `04_compare_models.py`
- `05_resume_run.py`
- `06_hooks_and_timeline.py`
- `07_judge_metric.py`
- `08_external_stage_handoff.py`
- `09_experiment_evolution.py`

`examples/medical_reasoning_eval` is intentionally left untouched as a handoff
reference. It is not the recommended public authoring pattern after the
benchmark-first redesign.

## Documentation

- Docs site: https://pittawat2542.github.io/themis/
- Quick Start: [docs/quick-start/index.md](docs/quick-start/index.md)
- Tutorials: [docs/tutorials/index.md](docs/tutorials/index.md)
- Concepts: [docs/concepts/index.md](docs/concepts/index.md)
- Guides: [docs/guides/index.md](docs/guides/index.md)
- API Reference: [docs/api-reference/index.md](docs/api-reference/index.md)
- FAQ: [docs/faq/index.md](docs/faq/index.md)

## Development

```bash
uv sync --all-extras --dev
uv run pytest
uv run mkdocs build --strict
uv run ruff check
```

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

## Citation

If you use Themis in research, cite via [`CITATION.cff`](CITATION.cff).

## License

MIT. See [LICENSE](LICENSE).
