Metadata-Version: 2.4
Name: themis-eval
Version: 3.1.0
Summary: Lightweight evaluation platform for LLM experiments
Author: Pittawat Taveekitworachai
License-Expression: MIT
Project-URL: Homepage, https://pittawat2542.github.io/themis/
Project-URL: Repository, https://github.com/Pittawat2542/themis
Project-URL: Documentation, https://pittawat2542.github.io/themis/
Project-URL: Changelog, https://github.com/Pittawat2542/themis/blob/main/CHANGELOG.md
Project-URL: Issues, https://github.com/Pittawat2542/themis/issues
Keywords: llm,evaluation,benchmark,experiments,ai
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: cyclopts<5,>=4.8
Requires-Dist: pydantic>=2.12.5
Requires-Dist: rich>=13.9.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=6.0.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.3.1; extra == "dev"
Requires-Dist: pytest-asyncio>=0.24.0; extra == "dev"
Requires-Dist: ruff>=0.8.5; extra == "dev"
Requires-Dist: mypy>=1.14.0; extra == "dev"
Provides-Extra: compression
Requires-Dist: zstandard>=0.25.0; extra == "compression"
Provides-Extra: datasets
Requires-Dist: datasets>=2.20.0; extra == "datasets"
Provides-Extra: extractors
Requires-Dist: jsonschema>=4.0.0; extra == "extractors"
Provides-Extra: math
Requires-Dist: math-verify[antlr4_13_2]>=0.7.0; extra == "math"
Provides-Extra: providers-openai
Requires-Dist: openai>=1.0.0; extra == "providers-openai"
Provides-Extra: providers-litellm
Requires-Dist: litellm>=1.81.0; extra == "providers-litellm"
Requires-Dist: tenacity>=9.1.2; extra == "providers-litellm"
Provides-Extra: providers-vllm
Requires-Dist: vllm>=0.17.0; sys_platform == "linux" and extra == "providers-vllm"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.6.0; extra == "docs"
Requires-Dist: mkdocs-material>=9.5.0; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.25.0; extra == "docs"
Provides-Extra: stats
Requires-Dist: numpy>=1.24.0; extra == "stats"
Requires-Dist: scipy>=1.10.0; extra == "stats"
Requires-Dist: pandas>=2.0.0; extra == "stats"
Provides-Extra: telemetry
Requires-Dist: langfuse>=3.0.0; extra == "telemetry"
Requires-Dist: wandb>=0.19.0; extra == "telemetry"
Provides-Extra: storage-postgres
Requires-Dist: psycopg[binary]>=3.2.0; extra == "storage-postgres"
Provides-Extra: all
Requires-Dist: themis-eval[compression,datasets,docs,extractors,math,providers-litellm,providers-openai,providers-vllm,stats,storage-postgres,telemetry]; extra == "all"
Dynamic: license-file

# Themis

> Benchmark-first orchestration for reproducible LLM evaluation.

[![CI](https://github.com/Pittawat2542/themis/actions/workflows/ci.yml/badge.svg)](https://github.com/Pittawat2542/themis/actions/workflows/ci.yml)
[![Docs](https://github.com/Pittawat2542/themis/actions/workflows/docs.yml/badge.svg)](https://github.com/Pittawat2542/themis/actions/workflows/docs.yml)
[![PyPI version](https://img.shields.io/pypi/v/themis-eval.svg)](https://pypi.org/project/themis-eval/)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-yellow.svg)](LICENSE)

Themis now documents and supports one public authoring flow:

- `ProjectSpec` for shared storage and execution policy
- `BenchmarkSpec` for benchmark slices, prompt variants, parse pipelines, scores, and agent-style prompt flows
- `PluginRegistry` for engines, parsers, metrics, judges, and hooks
- `Orchestrator` for planning, execution, handoffs, and imports
- `BenchmarkResult` for aggregation, paired comparison, artifact bundles, and timelines
- `generate_config_report(...)` for reproducibility snapshots
- `themis-quickcheck` for fast SQLite inspection by slice and benchmark dimension

## Why Themis

- Benchmark-native authoring instead of experiment-matrix bookkeeping
- Query-aware dataset providers for subset, filter, and pushdown sampling
- Explicit prompt variants and parse pipelines instead of payload hacks
- Bootstrap prompt sequences, scripted follow-up turns, and first-class tool passing for agent-capable engines
- OpenAI-hosted MCP server support for remote tools during evaluation runs
- Projection-backed results with `slice_id`, `prompt_variant_id`, and semantic dimensions
- Local-first storage and deterministic reuse of completed work
- Seed-aware planning and per-candidate deterministic execution defaults

## Installation

```bash
uv add themis-eval
```

Add extras only when needed:

- `stats` for paired comparisons and richer report tooling
- `compression` for compressed artifact storage
- `extractors` for additional built-in parsing helpers
- `math` for math-equivalence scoring via `math-verify`
- `datasets` for dataset integrations
- `providers-openai`, `providers-litellm`, `providers-vllm` for provider SDKs
- `telemetry` for external observability callbacks
- `storage-postgres` for Postgres-backed storage

## Quick Start

Start with a zero-friction smoke evaluation:

```bash
themis quick-eval inline \
  --model demo-model \
  --provider demo \
  --input "2 + 2" \
  --expected "4" \
  --format json
```

That writes a SQLite store under:

```text
.cache/themis/quick-eval/inline-demo-model-exact-match/themis.sqlite3
```

Initialize a real project scaffold when you want editable code and project files:

```bash
themis init starter-eval
```

Or start from a built-in benchmark definition:

```bash
themis quick-eval benchmark \
  --benchmark mmlu_pro \
  --model demo-model \
  --provider demo \
  --preview \
  --format json
```

```bash
themis init starter-mmlu --benchmark mmlu_pro
```

Math benchmarks are available as built-ins too:

```bash
themis quick-eval benchmark \
  --benchmark aime_2026 \
  --model demo-model \
  --provider demo \
  --preview \
  --format json
```

Then run the shipped hello-world benchmark when you want the smallest code-first example:

```bash
uv run python examples/01_hello_world.py
```

Expected output:

```text
{'model_id': 'demo-model', 'slice_id': 'arithmetic', 'metric_id': 'exact_match', 'source': 'synthetic', 'prompt_variant_id': 'qa-default', 'mean': 1.0, 'count': 1}
```

That script shows the full benchmark-first loop:

- define a `DatasetProvider.scan(slice_spec, query)`
- register one engine and one metric
- build a `BenchmarkSpec`
- run `orchestrator.run_benchmark(...)`
- inspect the returned `BenchmarkResult`

The complete script is embedded in [docs/quick-start/index.md](docs/quick-start/index.md).

## Examples

Runnable examples live in [`examples/`](examples/):

- `01_hello_world.py`
- `02_project_file.py`
- `03_custom_extractor_metric.py`
- `04_compare_models.py`
- `05_resume_run.py`
- `06_hooks_and_timeline.py`
- `07_judge_metric.py`
- `08_external_stage_handoff.py`
- `09_experiment_evolution.py`
- `10_agent_eval.py`
- `11_quick_benchmark.py`
- `12_iter_and_estimate.py`
- `13_catalog_builtin_benchmark.py`
- `14_mcp_openai.py`

`10_agent_eval.py` is the canonical advanced example for bootstrap prompts,
follow-up turns, tool declaration and selection, and returned agent traces.

`13_catalog_builtin_benchmark.py` is the catalog-specific example for running a
shipped builtin benchmark through `themis.catalog.build_catalog_benchmark_project(...)`
with a local fixture dataset loader.

`14_mcp_openai.py` shows the OpenAI-first MCP path for exposing a remote MCP
server to a benchmark run without using local `ToolSpec` handlers.

To discover all shipped builtin benchmark ids from Python, use:

```python
from themis.catalog import list_catalog_benchmarks

print(list_catalog_benchmarks())
```

The canonical benchmark list and Python usage notes live in
[docs/guides/builtin-benchmarks.md](docs/guides/builtin-benchmarks.md).

`examples/medical_reasoning_eval` is intentionally left untouched as a handoff
reference. It is not the recommended public authoring pattern after the
benchmark-first redesign.

## Documentation

- Docs site: https://pittawat2542.github.io/themis/
- Quick Start: [docs/quick-start/index.md](docs/quick-start/index.md)
- Tutorials: [docs/tutorials/index.md](docs/tutorials/index.md)
- Concepts: [docs/concepts/index.md](docs/concepts/index.md)
- Guides: [docs/guides/index.md](docs/guides/index.md)
- API Reference: [docs/api-reference/index.md](docs/api-reference/index.md)
- FAQ: [docs/faq/index.md](docs/faq/index.md)

## Development

```bash
uv sync --all-extras --dev
uv run pytest
uv run mkdocs build --strict
uv run ruff check
```

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

## Citation

If you use Themis in research, cite via [`CITATION.cff`](CITATION.cff).

## License

MIT. See [LICENSE](LICENSE).
