Metadata-Version: 2.4
Name: evallm-qa
Version: 0.1.1
Summary: QA framework for evaluating LLM outputs based on user-defined metrics
Project-URL: Repository, https://github.com/psandhaas/evaLLM
Project-URL: Documentation, https://psandhaas.github.io/evaLLM/
Author-email: Philipp Sandhaas <philipp.sandhaas@uni-potsdam.de>
License: MIT License
        
        Copyright (c) 2026 Philipp Sandhaas
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: evaluation,llm,metrics,qa
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.12
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: jsonschema[format-nongpl]>=4.26.0
Requires-Dist: langchain-core>=1.2.13
Requires-Dist: orjson>=3.11.7
Requires-Dist: pyyaml>=6.0.3
Requires-Dist: typer>=0.24.1
Provides-Extra: docs
Requires-Dist: griffe-fastapi>=0.1.6; extra == 'docs'
Requires-Dist: griffe-generics>=1.0.13; extra == 'docs'
Requires-Dist: griffe-inherited-docstrings>=1.1.2; extra == 'docs'
Requires-Dist: griffe-modernized-annotations>=1.0.8; extra == 'docs'
Requires-Dist: griffe-pydantic>=1.3.0; extra == 'docs'
Requires-Dist: markdown-pycon>=1.0.1; extra == 'docs'
Requires-Dist: markdown>=3.10.2; extra == 'docs'
Requires-Dist: mkdocs-api-autonav>=0.4.0; extra == 'docs'
Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.5.1; extra == 'docs'
Requires-Dist: mkdocs-glightbox>=0.5.2; extra == 'docs'
Requires-Dist: mkdocs-include-markdown-plugin>=7.2.1; extra == 'docs'
Requires-Dist: mkdocs-material>=9.7.1; extra == 'docs'
Requires-Dist: mkdocs-panzoom-plugin>=0.5.2; extra == 'docs'
Requires-Dist: mkdocstrings-python>=2.0.2; extra == 'docs'
Requires-Dist: pygments>=2.19.2; extra == 'docs'
Requires-Dist: pymdown-extensions>=10.21; extra == 'docs'
Provides-Extra: hf-tokenizers
Requires-Dist: huggingface-hub>=1.6.0; extra == 'hf-tokenizers'
Requires-Dist: tokenizers>=0.22.2; extra == 'hf-tokenizers'
Description-Content-Type: text/markdown

# evaLLM

evaLLM is a lightweight QA framework for evaluating LLM outputs with composable,
schema-validated metrics.

It is designed for two workflows:

- batch evaluation through a CLI
- embeddable evaluation pipelines through a Python API

## Quick links

- Documentation: https://psandhaas.github.io/evaLLM/
- Repository: https://github.com/psandhaas/evaLLM

## Installation

### Install from PyPI

```bash
pip install evallm-qa
```

### Install optional extras

Use this if you want encoding-based metrics (for example `count_encodings`):

```bash
pip install "evallm-qa[hf-tokenizers]"
```

### Install from source (secondary option)

Use this if you want to work from a local checkout:

```bash
git clone https://github.com/psandhaas/evaLLM.git
cd evaLLM
pip install .
```

Source install with extras:

```bash
pip install .[hf-tokenizers]
```

## What evaLLM evaluates

Current built-in metrics include:

- `CountsMetric`: character, token, or model-encoding counts
- `JsonFormatMetric`: field-level JSON format validation against a JSON Schema

Results are emitted per input record as JSON objects, which makes them easy to
pipe, persist, and analyze downstream.

## Library structure

evaLLM follows a layered architecture to keep concerns separated:

- `evallm.application`
    - orchestration layer
    - DTOs (`EvaluationRequest`, `MetricSpec`, dataset specs)
    - request resolution and evaluator execution
- `evallm.metrics`
    - base metric interface (`BaseMetric[T]`)
    - built-in metrics and registration catalog
    - metric composition (`CompositeMetric`)
- `evallm.readers`
    - dataset access layer (currently JSONL reader)
- `evallm.cli`
    - Typer-based command-line interface

Further details on architectural choices can be found in the [documentation](https://psandhaas.github.io/evaLLM/architecture/#application-layer).

## Composite metrics

evaLLM intentionally models multi-metric evaluation using the Composite design pattern.

Composite is a good fit because each metric is an independent computation over
the same input text, and the framework must aggregate all metric outputs into one
result object. This creates a tree-shaped execution model:

- a parent metric container evaluates children
- each child contributes a namespaced result
- multiple instances of the same metric class can coexist with different configs

This gives you two concrete benefits:

-  as a user you can rely on the fact that calling `evalute(text)` does exactly that;
your input is evaluated against however many metrics, with whichever configuration
you chose, and
- as a developer you only need to implement `evaluate(text)`, without having to consider
how or when your metric is called.

## Core concepts

### Evaluation request

An evaluation is defined by:

- one dataset spec (`dataset`)
- one or more metric specs (`metrics`)

### Metric spec

Each metric spec supports:

- `name`: registered metric identifier
- `config` (optional): runtime metric config
- `result_key` (optional): output field name override

### Dataset support

Current support:

- JSONL datasets (`JsonlDatasetSpec`)

Each JSONL record must contain a text field and can optionally provide a stable
record id field.

Extend [`BaseReader`](https://psandhaas.github.io/evaLLM/reference/evallm/readers/#evallm.readers.BaseReader) and [`BaseDatasetSpec`](https://psandhaas.github.io/evaLLM/reference/evallm/application/models/#evallm.application.models.BaseDatasetSpec) for further formats.

## Python API examples

### 1) Build and run an evaluation with helper presets

```python
from evallm.application import (
    JsonlEvaluationBuilder,
    count_characters,
    count_tokens,
    json_format,
)

builder = JsonlEvaluationBuilder(
    "./data/dataset.jsonl",
    text_key="text",
    record_id_key="id",
)

builder.use(count_characters(result_key="chars"))
builder.use(count_tokens(segmentation="wordpunct", result_key="tokens_wordpunct"))
builder.use(
    json_format(
        {
            "type": "object",
            "properties": {
                "answer": {"type": "string"},
                "confidence": {"type": "number", "minimum": 0.0, "maximum": 1.0},
            },
            "required": ["answer"],
        },
        result_key="format_ok",
    )
)

for result in builder.run():
    print(result.model_dump())
```

### 2) Build a request explicitly

```python
from evallm.application import EvaluationRequest

request = EvaluationRequest.create(
    obj={
        "dataset": {
            "input_file": "./data/dataset.jsonl",
            "text_key": "text",
            "record_id_key": "id",
        },
        "metrics": [
            {"name": "CountsMetric", "result_key": "chars"},
            {
                "name": "CountsMetric",
                "config": {"segments": "tokens", "segmentation": "whitespace"},
                "result_key": "tokens",
            },
        ],
    }
)
```

## CLI examples

### Evaluate via flags

```bash
evallm run \
    --input-file ./data/dataset.jsonl \
    --text-key text \
    --record-id-key id \
    --metric CountsMetric
```

### Evaluate via config file

```bash
evallm run --config ./run.yaml
```

### Discover available metrics and presets

```bash
evallm metrics list
evallm metrics info CountsMetric --format yaml
evallm metrics presets
```

### Write JSONL results to a file

```bash
evallm run \
    -f ./data/dataset.jsonl \
    -t text \
    -i id \
    -m CountsMetric \
    --output ./results.jsonl
```

## Config file reference

`evallm run` expects two top-level sections:

1. `dataset`
2. `metrics`

Minimal YAML example:

```yaml
dataset:
    input_file: ./data/dataset.jsonl
    text_key: text
    record_id_key: id

metrics:
    - name: CountsMetric
        result_key: chars

    - name: CountsMetric
        config:
            segments: tokens
            segmentation: whitespace
        result_key: tokens

    - name: JsonFormatMetric
        config:
            expected_schema:
                type: object
                properties:
                    answer:
                        type: string
                    confidence:
                        type: number
                        minimum: 0.0
                        maximum: 1.0
                required:
                    - answer
            check_formats: true
        result_key: structured_output
```

Important merge rule:

- if `--config` and inline flags are both provided, inline dataset flags override
    config fields
- if any `--metric` flags are provided, they replace `config.metrics`

## Extending evaLLM with custom metrics

Implement `BaseMetric[T]` and register with `@register_metric`.

```python
from evallm.metrics.base import BaseMetric
from evallm.metrics.registry import register_metric


@register_metric(name="token_lengths")
class TokenLengthsMetric(BaseMetric[list[int]]):
    def evaluate(self, text: str) -> dict[str, list[int]]:
        if text.strip() == "":
            return self.result([])
        return self.result([len(tok) for tok in text.split()])
```

## Development

Run tests:

```bash
uv run pytest
```

Without `uv`:

```bash
python -m pytest
```

Run docs locally:

```bash
uv run mkdocs serve
```

Without `uv`:

```bash
mkdocs serve
```

If you want to use `uv`, installation instructions are available at
https://docs.astral.sh/uv/getting-started/installation/.
