Metadata-Version: 2.4
Name: themis-eval
Version: 1.4.0
Summary: Lightweight evaluation platform for LLM experiments
Author: Pittawat Taveekitworachai
License: MIT
Project-URL: Homepage, https://pittawat2542.github.io/themis/
Project-URL: Repository, https://github.com/Pittawat2542/themis
Project-URL: Documentation, https://pittawat2542.github.io/themis/
Project-URL: Changelog, https://github.com/Pittawat2542/themis/blob/main/CHANGELOG.md
Project-URL: Issues, https://github.com/Pittawat2542/themis/issues
Keywords: llm,evaluation,benchmark,experiments,ai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.12.5
Requires-Dist: cyclopts>=4.0.0
Requires-Dist: hydra-core>=1.3
Requires-Dist: tqdm>=4.67
Requires-Dist: litellm>=1.81.0
Requires-Dist: tabulate>=0.9.0
Requires-Dist: tenacity>=9.1.2
Requires-Dist: rich>=14.2.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=6.0.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.3.1; extra == "dev"
Requires-Dist: pytest-asyncio>=0.24.0; extra == "dev"
Requires-Dist: ruff>=0.8.5; extra == "dev"
Requires-Dist: mypy>=1.14.0; extra == "dev"
Provides-Extra: math
Requires-Dist: datasets>=2.20.0; extra == "math"
Requires-Dist: math-verify>=0.8.0; extra == "math"
Provides-Extra: nlp
Requires-Dist: sacrebleu>=2.4.0; extra == "nlp"
Requires-Dist: rouge-score>=0.1.2; extra == "nlp"
Requires-Dist: bert-score>=0.3.13; extra == "nlp"
Requires-Dist: nltk>=3.8.0; extra == "nlp"
Provides-Extra: code
Requires-Dist: codebleu>=0.7.0; extra == "code"
Provides-Extra: viz
Requires-Dist: plotly>=5.18.0; extra == "viz"
Provides-Extra: langfuse
Requires-Dist: langfuse>=3.0.0; extra == "langfuse"
Provides-Extra: server
Requires-Dist: fastapi>=0.128.0; extra == "server"
Requires-Dist: uvicorn[standard]>=0.32.0; extra == "server"
Requires-Dist: websockets>=14.0; extra == "server"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.6.0; extra == "docs"
Requires-Dist: mkdocs-material>=9.5.0; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.25.0; extra == "docs"
Provides-Extra: all
Requires-Dist: themis-eval[code,docs,langfuse,math,nlp,server,viz]; extra == "all"
Dynamic: license-file

# Themis

> Lightweight, practical evaluation workflows for LLM experiments.

[![CI](https://github.com/Pittawat2542/themis/actions/workflows/ci.yml/badge.svg)](https://github.com/Pittawat2542/themis/actions/workflows/ci.yml)
[![Docs](https://github.com/Pittawat2542/themis/actions/workflows/docs.yml/badge.svg)](https://github.com/Pittawat2542/themis/actions/workflows/docs.yml)
[![PyPI version](https://img.shields.io/pypi/v/themis-eval.svg)](https://pypi.org/project/themis-eval/)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-yellow.svg)](LICENSE)

Themis gives you two clean entry points:

- `themis.evaluate(...)` for quick benchmark and custom dataset evaluation.
- `ExperimentSession().run(...)` for explicit, step-by-step evaluation pipelines.

It includes built-in benchmarks, auto-configured metric pipelines, caching and resumability, statistical comparison utilities, and a web server for run inspection.

## Why Themis

- **Fast start**: run your first evaluation in a few lines.
- **Structured control**: `ExperimentSession` API for reproducible workflows.
- **Built-in presets**: Curated benchmark definitions with prompts, metrics, and extractors included.
- **Extensible**: Register your own datasets, custom metrics, LLM providers, and benchmark presets.
- **Practical storage**: Local cache, resumable runs, and a robust storage backend.
- **Production-minded CI/CD**: Strict docs build, package validation, and release automation.

## Installation

```bash
# stable release
uv add themis-eval

# with optional extras
uv add "themis-eval[math,nlp,code,server]"
```

## Quick Start (No API key required)

Use our built-in `fake` model provider with the `demo` benchmark to ensure everything is installed correctly:

```python
from themis import evaluate

report = evaluate(
    "demo",
    model="fake:fake-math-llm", # Uses the built-in fake provider
    limit=10,
)

exact_match = report.evaluation_report.metrics["ExactMatch"]
print(f"ExactMatch: {exact_match.mean:.2%} (n={exact_match.count})")
```

## Quick Start (Real models)

Evaluating real models requires the corresponding provider's API key (e.g., `OPENAI_API_KEY`). By default, Themis uses [LiteLLM](https://github.com/BerriAI/litellm) for robust multi-provider routing.

```python
import os
from themis import evaluate

os.environ["OPENAI_API_KEY"] = "sk-..."

# Run the GSM8K math benchmark with GPT-4
report = evaluate(
    "gsm8k",
    model="openai/gpt-4o", # Model string parsed by LiteLLM
    limit=100,
)

# Extract and print the aggregated metric score
accuracy = report.evaluation_report.metrics["ExactMatch"].mean
print(f"Accuracy: {accuracy:.2%}")
```

## CLI Workflow

```bash
# Run two experiments
themis eval gsm8k --model gpt-4 --limit 100 --run-id run-a
themis eval gsm8k --model gpt-4 --temperature 0.7 --limit 100 --run-id run-b

# Compare them
themis compare run-a run-b

# Explore in browser
themis serve --storage .cache/experiments
```

Helpful commands:

```bash
themis list benchmarks
themis list runs --storage .cache/experiments
themis list metrics
```


## Built-in Coverage

Themis ships with math, reasoning, science, and QA presets (for example: `gsm8k`, `math500`, `aime24`, `aime25`, `mmlu-pro`, `supergpqa`, `gpqa`, `commonsense_qa`, `coqa`, `demo`).

List everything from CLI:

```bash
themis list benchmarks
```

Supported metric families include:

- exact/verification metrics (for math/structured outputs)
- NLP metrics (`BLEU`, `ROUGE`, `BERTScore`, `METEOR`)
- code metrics (`PassAtK`, `CodeBLEU`, execution-based checks)

## Extending Themis

Top-level extension APIs are available directly from `themis`:

```python
import themis

# themis.register_metric(name, metric_cls)
# themis.register_dataset(name, factory)
# themis.register_provider(name, factory)
# themis.register_benchmark(preset)
```

See the extension guides:

- [Extending Themis](https://pittawat2542.github.io/themis/)
- [API Backends Reference](docs/api/backends.md)

## Documentation

- Docs site: https://pittawat2542.github.io/themis/
- Getting started: [docs/getting-started/quickstart.md](docs/getting-started/quickstart.md)
- Evaluation guide: [docs/guides/evaluation.md](docs/guides/evaluation.md)
- Comparison guide: [docs/guides/comparison.md](docs/guides/comparison.md)

- CI/CD and release process: [docs/guides/ci-cd.md](docs/guides/ci-cd.md)

## Examples

Runnable examples live in [`examples/`](examples/):

- `01_quickstart.py`
- `02_custom_dataset.py`
- `04_comparison.py`
- `05_api_server.py`
- `07_provider_ready.py`
- `08_resume_cache.py`
- `09_research_loop.py`

Run one:

```bash
uv run python examples/01_quickstart.py
```

## Development

```bash
# install all dev + feature dependencies
uv sync --all-extras --dev

# test
uv run pytest

# strict docs build
uv run mkdocs build --strict

# baseline syntax/runtime lint used in CI
uv run ruff check --select E9,F63,F7 themis tests
```

## Contributing

Contributions are welcome. Start with [CONTRIBUTING.md](CONTRIBUTING.md).

## Citation

If you use Themis in research, cite via [`CITATION.cff`](CITATION.cff).

## License

MIT. See [LICENSE](LICENSE).
