Metadata-Version: 2.4
Name: themis-eval
Version: 2.1.0
Summary: Lightweight evaluation platform for LLM experiments
Author: Pittawat Taveekitworachai
License-Expression: MIT
Project-URL: Homepage, https://pittawat2542.github.io/themis/
Project-URL: Repository, https://github.com/Pittawat2542/themis
Project-URL: Documentation, https://pittawat2542.github.io/themis/
Project-URL: Changelog, https://github.com/Pittawat2542/themis/blob/main/CHANGELOG.md
Project-URL: Issues, https://github.com/Pittawat2542/themis/issues
Keywords: llm,evaluation,benchmark,experiments,ai
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.12.5
Requires-Dist: rich>=13.9.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=6.0.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.3.1; extra == "dev"
Requires-Dist: pytest-asyncio>=0.24.0; extra == "dev"
Requires-Dist: ruff>=0.8.5; extra == "dev"
Requires-Dist: mypy>=1.14.0; extra == "dev"
Provides-Extra: compression
Requires-Dist: zstandard>=0.25.0; extra == "compression"
Provides-Extra: datasets
Requires-Dist: datasets>=2.20.0; extra == "datasets"
Provides-Extra: extractors
Requires-Dist: jsonschema>=4.0.0; extra == "extractors"
Provides-Extra: providers-openai
Requires-Dist: openai>=1.0.0; extra == "providers-openai"
Provides-Extra: providers-litellm
Requires-Dist: litellm>=1.81.0; extra == "providers-litellm"
Requires-Dist: tenacity>=9.1.2; extra == "providers-litellm"
Provides-Extra: providers-vllm
Requires-Dist: vllm>=0.17.0; sys_platform == "linux" and extra == "providers-vllm"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.6.0; extra == "docs"
Requires-Dist: mkdocs-material>=9.5.0; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.25.0; extra == "docs"
Provides-Extra: stats
Requires-Dist: numpy>=1.24.0; extra == "stats"
Requires-Dist: scipy>=1.10.0; extra == "stats"
Requires-Dist: pandas>=2.0.0; extra == "stats"
Provides-Extra: telemetry
Requires-Dist: langfuse>=3.0.0; extra == "telemetry"
Requires-Dist: wandb>=0.19.0; extra == "telemetry"
Provides-Extra: storage-postgres
Requires-Dist: psycopg[binary]>=3.2.0; extra == "storage-postgres"
Provides-Extra: all
Requires-Dist: themis-eval[compression,datasets,docs,extractors,providers-litellm,providers-openai,providers-vllm,stats,storage-postgres,telemetry]; extra == "all"
Dynamic: license-file

# Themis

> Typed, code-first orchestration for reproducible LLM evaluation.

[![CI](https://github.com/Pittawat2542/themis/actions/workflows/ci.yml/badge.svg)](https://github.com/Pittawat2542/themis/actions/workflows/ci.yml)
[![Docs](https://github.com/Pittawat2542/themis/actions/workflows/docs.yml/badge.svg)](https://github.com/Pittawat2542/themis/actions/workflows/docs.yml)
[![PyPI version](https://img.shields.io/pypi/v/themis-eval.svg)](https://pypi.org/project/themis-eval/)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-yellow.svg)](LICENSE)

Themis centers on a small public surface:

- `ProjectSpec` for shared storage and execution policy
- `ExperimentSpec` for the experiment matrix
- `PluginRegistry` for engines, extractors, metrics, judges, and hooks
- `Orchestrator` for planning, running, and materializing trials
- `ExperimentResult` for timelines, reports, and paired comparisons
- `generate_config_report(...)` for reproducibility-focused config snapshots
- `themis-quickcheck` for fast SQLite summary inspection

## Why Themis

- **Deterministic planning**: typed specs expand into stable trial hashes.
- **Local-first storage**: append-only events plus projection tables in SQLite.
- **Extensible runtime**: register your own engines, extractors, metrics, judges, and hooks.
- **Inspectable outputs**: read trials, timelines, reports, and paired comparisons from one result object.
- **Predictable resume behavior**: completed trials are skipped when storage, specs, and revision match.
- **Config-as-documentation reporting**: snapshot nested experiment parameters into JSON, YAML, Markdown, or LaTeX.

## Installation

```bash
uv add themis-eval

# add extras as needed
uv add "themis-eval[stats,compression]"
```

For the full optional-extra matrix, including `datasets`, provider SDKs,
telemetry, docs tooling, and the contributor toolchain, see
[docs/installation-setup/index.md](docs/installation-setup/index.md).

## Hello World

The runnable quick-start script lives at
[`examples/01_hello_world.py`](examples/01_hello_world.py). It uses local demo
components, so it runs without API keys or provider extras.

The workflow is:

- create a `PluginRegistry` with one fake engine and one metric
- build `ProjectSpec` and `ExperimentSpec` from top-level `themis` imports
- run `Orchestrator.from_project_spec(...)`
- inspect scores from the returned `ExperimentResult`

For the full script, use the example file directly or the
[Quick Start guide](docs/quick-start/index.md), which embeds that same file.

## Package Namespaces

The root `themis` package stays intentionally small. Additional convenience
namespaces are available when you want lower-level types without long module
paths:

- `themis.records` re-exports persisted record models such as `TrialRecord`
  and `CandidateRecord`
- `themis.types` re-exports shared enums and event/value types used across the
  runtime
- `themis.stats` re-exports paired-comparison tooling and requires the
  `stats` extra

These namespaces are lazy-loaded so the base install keeps a small import
surface and clear optional-dependency boundaries.

## Examples

Runnable examples live in [`examples/`](examples/):

- `01_hello_world.py`
- `02_project_file.py`
- `03_custom_extractor_metric.py`
- `04_compare_models.py`
- `05_resume_run.py`
- `06_hooks_and_timeline.py`
- `07_judge_metric.py`

## Config Reports

Use `generate_config_report(...)` when you need a human-readable snapshot of the
exact nested config used for an experiment:

```python
from pathlib import Path

from themis import generate_config_report

bundle = {"project": project, "experiment": experiment}
markdown = generate_config_report(bundle, format="markdown")
latex = generate_config_report(bundle, format="latex", output=Path("config-report.tex"))
full_json = generate_config_report(bundle, format="json", verbosity="full")
```

The same collected structure can be rendered as `json`, `yaml`, `markdown`, or
`latex`, with `verbosity="default"` for a paper-facing summary and
`verbosity="full"` for the exhaustive view. Source metadata is best-effort for
dynamic or third-party classes. Custom formats can be registered with
`register_config_report_renderer(...)`. For CLI usage and a full worked example, see
[docs/guides/config-reports.md](docs/guides/config-reports.md).

## Documentation

- Docs site: https://pittawat2542.github.io/themis/
- Quick Start: [docs/quick-start/index.md](docs/quick-start/index.md)
- Concepts: [docs/concepts/index.md](docs/concepts/index.md)
- Guides: [docs/guides/index.md](docs/guides/index.md)
- Release Checklist: [docs/guides/releasing.md](docs/guides/releasing.md)
- API Reference: [docs/api-reference/index.md](docs/api-reference/index.md)
- FAQ: [docs/faq/index.md](docs/faq/index.md)

## Development

```bash
# install all dev + feature dependencies
uv sync --all-extras --dev

# test
uv run pytest

# strict docs build
uv run mkdocs build --strict

# baseline lint
uv run ruff check
```

## Contributing

Contributions are welcome. Start with [CONTRIBUTING.md](CONTRIBUTING.md).

## Citation

If you use Themis in research, cite via [`CITATION.cff`](CITATION.cff).

## License

MIT. See [LICENSE](LICENSE).
